Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning

Size: px
Start display at page:

Download "Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning"

Transcription

1 Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Julien Mairal Inria, LEAR Team, Grenoble Journées MAS, Toulouse Julien Mairal Incremental and Stochastic MM Algorithms 1/27

2 Statistical modeling with regularized risk minimization Given some data points x i, i = 1,...,n, learn some model parameters θ in R p by minimizing 1 min θ R p n n l(x i,θ)+λψ(θ), i=1 where l measures the data fit, and ψ is a regularizer. The goal of this work is to deal with large n for relatively non-standard settings (non-convex,non-smooth,stochastic) Julien Mairal Incremental and Stochastic MM Algorithms 2/27

3 A simple (naive) optimization principle g(θ) κ f(θ) Objective: min θ Θ f(θ) Principle called Majorization-Minimization [Lange et al., 2000]; quite popular in statistics and signal processing. Julien Mairal Incremental and Stochastic MM Algorithms 3/27

4 In this work g(θ) κ f(θ) scalable Majorization-Minimization algorithms; for convex or non-convex and smooth or non-smooth problems; References J. Mairal. Optimization with First-Order Surrogate Functions. ICML 13; J. Mairal. Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization. NIPS 13. Julien Mairal Incremental and Stochastic MM Algorithms 4/27

5 In this work Methodology extend the MM principle to a large variety of settings; compute convergence rates for convex problems; show stationary point conditions for non-convex ones. First direction: incremental optimization minimizes (1/n) n i=1 f i (θ); requires some memory about past iterates; fast convergence rate for several passes over the data. Second direction: stochastic optimization no memory about past iterates; minimizes E x [f(θ,x)]. Julien Mairal Incremental and Stochastic MM Algorithms 5/27

6 Related work incremental approaches for convex optimization stochastic average gradient [Schmidt, Roux, and Bach, 2013]; stochastic dual coordinate ascent [Shalev-Schwartz and Zhang, 2012]. stochastic optimization stochastic proximal methods, e.g., [Duchi and Singer, 2009, Atchade et al., 2014]; literature about stochastic gradient descent, see, e.g., [Nemirovski et al., 2009]; non-convex optimization DC programming, see, e.g., [Gasso et al., 2009]; online EM [Neal and Hinton, 1998, Cappé and Moulines, 2009]. Julien Mairal Incremental and Stochastic MM Algorithms 6/27

7 Setting: First-Order Surrogate Functions h(θ) g(θ) κ f(θ) g(θ ) f(θ ) for all θ in argmin θ Θ g(θ); the approximation error h = g f is differentiable, and h is L-Lipschitz. Moreover, h(κ) = 0 and h(κ) = 0; we sometimes assume g to be strongly convex. Julien Mairal Incremental and Stochastic MM Algorithms 7/27

8 The Basic MM Algorithm Algorithm 1 Basic Majorization-Minimization Scheme 1: Input: θ 0 Θ (initial estimate); T (number of iterations). 2: for t = 1,...,T do 3: Compute a surrogate g t of f near θ t 1 ; 4: Minimize g t and update the solution: 5: end for 6: Output: θ T (final estimate); θ t argming t (θ). θ Θ Julien Mairal Incremental and Stochastic MM Algorithms 8/27

9 Examples of First-Order Surrogate Functions Lipschitz Gradient Surrogates: f is L-smooth (differentiable + L-Lipschitz gradient). g : θ f(κ)+ f(κ) (θ κ)+ L 2 θ κ 2 2. Minimizing g yields a gradient descent step θ κ 1 L f(κ). Julien Mairal Incremental and Stochastic MM Algorithms 9/27

10 Examples of First-Order Surrogate Functions Lipschitz Gradient Surrogates: f is L-smooth (differentiable + L-Lipschitz gradient). g : θ f(κ)+ f(κ) (θ κ)+ L 2 θ κ 2 2. Minimizing g yields a gradient descent step θ κ 1 L f(κ). Proximal Gradient Surrogates: f = f 1 +f 2 with f 1 smooth. g : θ f 1 (κ)+ f 1 (κ) (θ κ)+ L 2 θ κ 2 2 +f 2 (θ). Minimizing g amounts to one step of the forward-backward, ISTA, or proximal gradient descent algorithm. [Beck and Teboulle, 2009, Combettes and Pesquet, 2010, Wright et al., 2008, Nesterov, 2007]. Julien Mairal Incremental and Stochastic MM Algorithms 9/27

11 Examples of First-Order Surrogate Functions Linearizing Concave Functions and DC-Programming: f = f 1 +f 2 with f 2 smooth and concave. g : θ f 1 (θ)+f 2 (κ)+ f 2 (κ) (θ κ). When f 1 is convex, the algorithm is called DC-programming. Julien Mairal Incremental and Stochastic MM Algorithms 10/27

12 Examples of First-Order Surrogate Functions Linearizing Concave Functions and DC-Programming: f = f 1 +f 2 with f 2 smooth and concave. g : θ f 1 (θ)+f 2 (κ)+ f 2 (κ) (θ κ). When f 1 is convex, the algorithm is called DC-programming. Quadratic Surrogates: f is twice differentiable, and H is a uniform upper bound of 2 f: g : θ f(κ)+ f(κ) (θ κ)+ 1 2 (θ κ) H(θ κ). Actually a big deal in statistics and machine learning [Böhning and Lindsay, 1988, Khan et al., 2010, Jebara and Choromanska, 2012].... Julien Mairal Incremental and Stochastic MM Algorithms 10/27

13 Theoretical Guarantees When using first-order surrogates, for convex problems: f(θ t ) f = O(1/t). for µ-strongly convex ones: O((1 µ/l) t ). for non-convex problems: f(θ t ) monotonically decreases and lim inf inf f(θ t,θ θ t ) 0, (1) t + θ Θ θ θ t 2 which we call asymptotic stationary point condition. Directional derivative f(θ +εκ) f(θ) f(θ,κ) = lim. ε 0 + ε when in addition Θ = R p, (1) is equivalent to f(θ t ) 0. Julien Mairal Incremental and Stochastic MM Algorithms 11/27

14 Incremental Optimization: MISO Suppose that f splits into many components: f(θ) = 1 n n f i (θ). i=1 Recipe Incrementally update an approximate surrogate 1 n n i=1 gi ; add some heuristics for practical implementations. Related work for convex problems related to SAG [Schmidt et al., 2013] and SDCA [Shalev-Schwartz and Zhang, 2012], but offers different update rules. Julien Mairal Incremental and Stochastic MM Algorithms 12/27

15 Incremental Optimization: MISO Algorithm 2 Incremental Scheme MISO 1: Input: θ 0 Θ; T (number of iterations). 2: Choose surrogates g0 i of f i near θ 0 for all i; 3: for t = 1,...,T do 4: Randomly pick up one index î t and choose a surrogate gît t of f ît near θ t 1. Set gt i = gt 1 i for i î t; 5: Update the solution: 6: end for 7: Output: θ T (final estimate); 1 θ t argmin θ Θ n n gt(θ). i i=1 Julien Mairal Incremental and Stochastic MM Algorithms 13/27

16 Incremental Optimization: MISO Update rule with Lipschitz gradient surrogates We want to minimize 1 n n i=1 f i (θ). θ t = argmin θ Θ n = 1 n i=1 1 n n f i (κ i )+ f i (κ i ) (θ κ i )+ L 2 θ κi 2 2 i=1 κ i 1 Ln n f i (κ i ). i=1 At iteration n, randomly draw one index î t, and update κît θ t. Remarks replace (1/n) n i=1 κi by θ t 1 yields SAG [Schmidt et al., 2013]. replace (1/L) by (1/µ) is almost identical to SDCA [Shalev-Schwartz and Zhang, 2012]. Julien Mairal Incremental and Stochastic MM Algorithms 14/27

17 Incremental Optimization: MISO Update rule for proximal gradient surrogates We want to minimize 1 n n i=1 f i (θ)+ψ(θ). θ t = argmin θ Θ = argmin θ Θ 1 n 1 2 n f i (κ i t)+ f i (κ i t) (θ κ i t)+ L 2 θ κi t 2 2 +ψ(θ) i=1 θ ( 1 n n i=1 κ i t 1 Ln n 2 f i (κt)) i + 1 L ψ(θ). i=1 2 Julien Mairal Incremental and Stochastic MM Algorithms 15/27

18 Incremental Optimization: MISO Theoretical Guarantees for non-convex problems, the guarantees are the same as the generic MM algorithm with probability one. for convex problems and proximal gradient surrogates, the expected convergence rate with averaging becomes O(n/t). for µ-strongly convex problems and proximal gradient surrogates, the expected convergence rate is linear O((1 µ/(nl)) t ). Remarks for µ-strongly convex problems the rates of SDCA and SAG in this setting are better: µ/(ln) is replaced by O(min(µ/L, 1/n)); the MM principle is too conservative. For smooth problems, we can match these rates by using minorizing surrogates [Mairal, 2014]. Julien Mairal Incremental and Stochastic MM Algorithms 16/27

19 Incremental Optimization: MISO Example for l 2 -logistic regression: 1 min θ R p n n log(1+e y iθ x i )+ λ 2 θ 2 2. i=1 The problem is λ-strongly convex. Table : Description of datasets used in our experiments. name n p storage density size (GB) alpha dense ocr dense rcv sparse webspam sparse Julien Mairal Incremental and Stochastic MM Algorithms 17/27

20 Incremental Optimization: MISO Relative duality gap Relative duality gap SGD h FISTA LS SDCA SAG MISO0 MISO2 MISO µ Effective passes over data / Dataset alpha Effective passes over data / Dataset ocr Relative duality gap Relative duality gap SGD h FISTA LS SDCA SAG MISO0 mb MISO2 mb MISO µ Effective passes over data / Dataset rcv Effective passes over data / Dataset webspam Julien Mairal Incremental and Stochastic MM Algorithms 18/27

21 Incremental DC programming Consider a binary classification problem with n training samples (y i,x i ), with y i in { 1,+1} and x i in R p. Assume that there exists a sparse linear model y sign(θ x), learned by minimizing 1 min θ R p n n log(1+e y iθ x i )+λψ(θ). i=1 Traditional choices for ψ: ψ(θ) = θ 2 2 or θ 1. Non-convex sparsity inducing penalty: ψ(θ) = p j=1 log( θ[j] +ε). ψ(θ) θ Julien Mairal Incremental and Stochastic MM Algorithms 19/27

22 Incremental DC programming upper-bound f i : θ log(1+e y iθ x i ) by θ f i (κ i )+ f i (θ t 1 ) (θ θ t 1 )+ L 2 θ θ t 1 2 2; upper-bound λ p j=1log( θ[j] +ε) by θ λ p j=1 θ[j] θ t 1 [j] +ε. this is an incremental reweighted-l 1 algorithm [Candès et al., 2008]. The overall surrogate can be minimized in closed-form by using soft-thresholding. Julien Mairal Incremental and Stochastic MM Algorithms 20/27

23 Incremental Optimization: MISO Objective Function Objective Function Effective passes over data / Dataset alpha Effective passes over data / Dataset rcv1 Objective Function Objective Function Effective passes over data / Dataset ocr 366 MM MM LS MISO0 MISO Effective passes over data / Dataset webspam Julien Mairal Incremental and Stochastic MM Algorithms 21/27

24 Stochastic Majorization Minimization: SMM Suppose that f is an expectation: f(θ) = E x [l(θ,x)]. Recipe Draw a function f t : θ l(θ,x t ) at iteration t; Iteratively update an approximate surrogate ḡ t = (1 w t )ḡ t 1 +w t g t ; Choose appropriate w t. Related Work online-em [Neal and Hinton, 1998, Cappé and Moulines, 2009]; online dictionary learning [Mairal et al., 2010a]. Julien Mairal Incremental and Stochastic MM Algorithms 22/27

25 Stochastic Majorization Minimization: SMM Algorithm 3 Stochastic Majorization-Minimization Scheme 1: Input: θ 0 Θ (initial estimate); T (number of iterations); (w t ) t 1, weights in (0,1]; 2: initialize the approximate surrogate: ḡ 0 : θ ρ 2 θ θ ; 3: for t = 1,...,T do 4: draw a training point x t ; 5: choose a surrogate function g t of f t : θ l(x t,θ) near θ t 1 ; 6: update the approximate surrogate: ḡ t = (1 w t )ḡ t 1 +w t g t ; 7: update the current estimate: 8: end for 9: Output: θ T (current estimate); θ t argminḡ t (θ); θ Θ Julien Mairal Incremental and Stochastic MM Algorithms 23/27

26 Stochastic Majorization Minimization: SMM Update Rule for Proximal Gradient Surrogate θ t argmin θ Θ t i=1 w i t [ fi (θ i 1 ) θ + L 2 θ θ i ψ(θ)]. (SMM) Other schemes in the literature [Duchi and Singer, 2009]: θ t argmin f t (θ t 1 ) θ + 1 2η t θ θ t ψ(θ), θ Θ (FOBOS) or regularized dual averaging (RDA) of Xiao [2010]: θ t argmin θ Θ 1 t t i=1 f i (θ i 1 ) θ + 1 2η t θ 2 2 +ψ(θ). (RDA) or others... Julien Mairal Incremental and Stochastic MM Algorithms 24/27

27 Stochastic Majorization Minimization: SMM Theoretical Guarantees - Non-Convex Problems under a set of reasonable assumptions, f(θ t ) almost surely converges; the function ḡ t asymptotically behaves as a first-order surrogate; we almost surely have asymptotic stationary point conditions. Theoretical Guarantees - Convex Problems for proximal gradient surrogates, we obtain similar expected rates as SGD with averaging [see Nemirovski et al., 2009]: O(1/t) for strongly convex problems, O(log(t)/ t) for convex ones. (under bounded subgradients assumptions and specific w t ). Julien Mairal Incremental and Stochastic MM Algorithms 25/27

28 Experimental Conclusions for l 2 -logistic Regression Incremental and stochastic schemes were significantly faster than batch ones; MISO with heuristics was competitive with the state of the art (SAG, SGD, Liblinear); after one pass over the data, SMM quickly achieves a low-precision solution. For higher precision, MISO is prefered. problems tested were large but relatively well conditioned. Julien Mairal Incremental and Stochastic MM Algorithms 26/27

29 Conclusion What we have done we have given a unified view of a large number of algorithms;... and introduced new ones for large-scale optimization. A take-home message our algorithms are likely to be useful for large-scale non-convex and possibly non-smooth problems, which is a relatively non-standard, but useful, setting. Source Code code is now available in the toolbox SPAMS (C++ interfaced with Matlab, Python, R). Julien Mairal Incremental and Stochastic MM Algorithms 27/27

30 Examples of First-Order Surrogate Functions More Exotic Surrogates: Consider a smooth approximation of the trace (nuclear) norm see François Caron s talk) ( f µ : θ Tr (θ θ +µi) 1/2) = p i=1 λ i (θ θ)+µ, f : H Tr ( H 1/2) is concave on the set of p.d. matrices and f (H) = (1/2)H 1/2. g µ : θ f µ (κ)+ 1 ) ((κ 2 Tr κ+µi) 1/2 (θ θ κ κ), which yields the algorithm of Mohan and Fazel [2012]. a and also variational, saddle-point, Jensen surrogates... Julien Mairal Incremental and Stochastic MM Algorithms 28/27

31 Examples of First-Order Surrogate Functions Variational Surrogates: f(θ 1 ) = min θ2 Θ 2 f(θ 1,θ 2 ), where f is smooth w.r.t θ 1 and strongly convex w.r.t θ 2 : g : θ 1 f(θ 1,κ 2) with κ 2 = argmin θ 2 Θ 2 f(κ1,θ 2 ). Saddle-Point Surrogates: f(θ 1 ) = max θ2 Θ 2 f(θ 1,θ 2 ), where f is smooth w.r.t θ 1 and strongly concave w.r.t θ 2 : g : θ 1 f(θ 1,κ 2)+ L 2 θ 1 κ Jensen Surrogates: f(θ) = f(x θ), where f is L-smooth. Choose a weight vector w in R p + w 1 = 1 and w i 0 whenever x i 0. p ( ) xi g : θ w i f (θ i κ i )+x κ, w i i=1 such that Julien Mairal Incremental and Stochastic MM Algorithms 29/27

32 Stochastic DC programming For logistic-regression with non-convex sparsity-inducing penalty. Objective on Train Set Objective on Train Set Online DC Batch DC Iterations Epochs / Dataset rcv1 Online DC Batch DC Iterations Epochs / Dataset webspam Objective on Test Set Objective on Test Set Online DC Batch DC Iterations Epochs / Dataset rcv Online DC Batch DC Iterations Epochs / Dataset webspam Julien Mairal Incremental and Stochastic MM Algorithms 30/27

33 Other variants of MM We also study in [Mairal, 2013a] a block coordinate scheme for non-convex and convex optimization. Also several variants for convex optimization: an accelerated one (Nesterov s like); a Frank-Wolfe majorization-minimization algorithm. Julien Mairal Incremental and Stochastic MM Algorithms 31/27

34 Online Dictionary Learning Experimental results, batch vs online k = 256 m = 8 8, Julien Mairal Incremental and Stochastic MM Algorithms 32/27

35 Online Dictionary Learning Experimental results, batch vs online m = , k = 512 Julien Mairal Incremental and Stochastic MM Algorithms 33/27

36 Online Dictionary Learning Experimental results, batch vs online k = 1024 m = 16 16, Julien Mairal Incremental and Stochastic MM Algorithms 34/27

37 Online Structured Matrix Factorization With a structured regularization function ϕ [Jenatton et al., 2009] ϕ(d) = γ K 1 j=1 g G max k g d j [k] +γ 2 D 2 F. The proximal operator of ϕ can be computed by using network flow optimization [Mairal et al., 2010b]. Figure : Left: subset of a larger dictionary obtained with l 1 ; Right: subset obtained with ϕ after initialization with the dictionary on the left. About 20 minutes per pass on the data on the 1.2GHz laptop CPU. Julien Mairal Incremental and Stochastic MM Algorithms 35/27

38 Online Sparse Matrix Factorization Consider some signals x in R m. We want to find a dictionary D in R m K. The quality of D is measured through the loss l(x,d) = 1 min α R K 2 x Dα 2 2 +λ 1 α 1 + λ 2 2 α 2 2. Then, learning the dictionary amounts to solving min E x[l(x,d)]+ϕ(d), D C Why is it a matrix factorization problem? [ 1 1 min D C,A R K n n 2 X DA 2 F + n i=1 λ 1 α i 1 + λ 2 2 α i 2 2 ] +ϕ(d). Julien Mairal Incremental and Stochastic MM Algorithms 36/27

39 Online Structured Matrix Factorization when C = {D R m K s.t. d j 2 1} and ϕ = 0, the problem is called sparse coding or dictionary learning [Olshausen and Field, 1997, Elad and Aharon, 2006, Mairal et al., 2010a]. non-negativity constraints can be easily added. It yields an online nonnegative matrix factorization algorithm. ϕ can be a function encouraging a particular structure in D [Jenatton et al., 2009]. Julien Mairal Incremental and Stochastic MM Algorithms 37/27

40 Online Structured Matrix Factorization Dictionary Learning on Natural Image Patches Consider n = whitened natural image patches of size m = We learn a dictionary with K = 256 elements. 0s on an old laptop 1.2GHz dual-core CPU. (initialization) Julien Mairal Incremental and Stochastic MM Algorithms 38/27

41 Online Structured Matrix Factorization Dictionary Learning on Natural Image Patches Consider n = whitened natural image patches of size m = We learn a dictionary with K = 256 elements. 1.15s on an old laptop 1.2GHz dual-core CPU (0.1 pass) Julien Mairal Incremental and Stochastic MM Algorithms 28/27

42 Online Structured Matrix Factorization Dictionary Learning on Natural Image Patches Consider n = whitened natural image patches of size m = We learn a dictionary with K = 256 elements. 5.97s on an old laptop 1.2GHz dual-core CPU (0.5 pass) Julien Mairal Incremental and Stochastic MM Algorithms 28/27

43 Online Structured Matrix Factorization Dictionary Learning on Natural Image Patches Consider n = whitened natural image patches of size m = We learn a dictionary with K = 256 elements s on an old laptop 1.2GHz dual-core CPU (1 pass) Julien Mairal Incremental and Stochastic MM Algorithms 28/27

44 Online Structured Matrix Factorization Dictionary Learning on Natural Image Patches Consider n = whitened natural image patches of size m = We learn a dictionary with K = 256 elements s on an old laptop 1.2GHz dual-core CPU (2 passes) Julien Mairal Incremental and Stochastic MM Algorithms 28/27

45 Online Structured Matrix Factorization Dictionary Learning on Natural Image Patches Consider n = whitened natural image patches of size m = We learn a dictionary with K = 256 elements s on an old laptop 1.2GHz dual-core CPU (5 passes) Julien Mairal Incremental and Stochastic MM Algorithms 28/27

46 References I Y. F Atchade, G. Fort, and E. Moulines. On stochastic proximal gradient algorithms. arxiv preprint arxiv: , A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1): , D. Böhning and B. G. Lindsay. Monotonicity of quadratic-approximation algorithms. Annals of the Institute of Statistical Mathematics, 40(4): , E. J. Candès, M. Wakin, and S. Boyd. Enhancing sparsity by reweighted l1 minimization. Journal of Fourier Analysis and Applications, 14: , O. Cappé and E. Moulines. On-line expectation maximization algorithm for latent data models. 71(3): , Julien Mairal Incremental and Stochastic MM Algorithms 29/27

47 References II P. L. Combettes and J.-C. Pesquet. Proximal splitting methods in signal processing. In Fixed-Point Algorithms for Inverse Problems in Science and Engineering. Springer, J. Duchi and Y. Singer. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10: , M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image Processing, 54(12): , December G. Gasso, A. Rakotomamonjy, and S. Canu. Recovering sparse signals with non-convex penalties and DC programming. IEEE Transactions on Signal Processing, 57(12): , T. Jebara and A. Choromanska. Majorization for CRFs and latent likelihoods. In Advances in Neural Information Processing Systems, Julien Mairal Incremental and Stochastic MM Algorithms 30/27

48 References III R. Jenatton, J-Y. Audibert, and F. Bach. Structured variable selection with sparsity-inducing norms. Technical report, preprint arxiv: v1. Emtiyaz Khan, Ben Marlin, Guillaume Bouchard, and Kevin Murphy. Variational bounds for mixed-data factor analysis. In Advances in Neural Information Processing Systems, K. Lange, D.R. Hunter, and I. Yang. Optimization transfer using surrogate objective functions. 9(1):1 20, J. Mairal. Optimization with first-order surrogate functions. In Proceedings of the International Conference on Machine Learning (ICML), 2013a. J. Mairal. Stochastic majorization-minimization algorithms for large-scale optimization. In Advances in Neural Information Processing Systems, 2013b. Julien Mairal Incremental and Stochastic MM Algorithms 31/27

49 References IV J. Mairal. Incremental majorization-minimization optimization with application to large-scale machine learning. preprint arxiv: , J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 2010a. J. Mairal, R. Jenatton, G. Obozinski, and F. Bach. Network flow algorithms for structured sparsity. In Advances in Neural Information Processing Systems, 2010b. K. Mohan and M. Fazel. Iterative reweighted algorithms for matrix rank minimization. Journal of Machine Learning Research, (13): , R.M. Neal and G.E. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. Learning in graphical models, 89, Julien Mairal Incremental and Stochastic MM Algorithms 32/27

50 References V A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. 19(4): , Y. Nesterov. Gradient methods for minimizing composite objective function. Technical report, CORE, B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37: , Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. arxiv: , S. Shalev-Schwartz and T. Zhang. Proximal stochastic dual coordinate ascent. preprint arxiv v1, S. Wright, R. Nowak, and M. Figueiredo. Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing, Julien Mairal Incremental and Stochastic MM Algorithms 33/27

51 References VI L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11: , Julien Mairal Incremental and Stochastic MM Algorithms 34/27

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

A Universal Catalyst for Gradient-Based Optimization

A Universal Catalyst for Gradient-Based Optimization A Universal Catalyst for Gradient-Based Optimization Julien Mairal Inria, Grenoble CIMI workshop, Toulouse, 2015 Julien Mairal, Inria Catalyst 1/58 Collaborators Hongzhou Lin Zaid Harchaoui Publication

More information

Recent Advances in Structured Sparse Models

Recent Advances in Structured Sparse Models Recent Advances in Structured Sparse Models Julien Mairal Willow group - INRIA - ENS - Paris 21 September 2010 LEAR seminar At Grenoble, September 21 st, 2010 Julien Mairal Recent Advances in Structured

More information

Topographic Dictionary Learning with Structured Sparsity

Topographic Dictionary Learning with Structured Sparsity Topographic Dictionary Learning with Structured Sparsity Julien Mairal 1 Rodolphe Jenatton 2 Guillaume Obozinski 2 Francis Bach 2 1 UC Berkeley 2 INRIA - SIERRA Project-Team San Diego, Wavelets and Sparsity

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July

More information

Stochastic optimization: Beyond stochastic gradients and convexity Part I

Stochastic optimization: Beyond stochastic gradients and convexity Part I Stochastic optimization: Beyond stochastic gradients and convexity Part I Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint tutorial with Suvrit Sra, MIT - NIPS

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - September 2012 Context Machine

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation Zhouchen Lin Peking University April 22, 2018 Too Many Opt. Problems! Too Many Opt. Algorithms! Zero-th order algorithms:

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Structured Sparse Estimation with Network Flow Optimization

Structured Sparse Estimation with Network Flow Optimization Structured Sparse Estimation with Network Flow Optimization Julien Mairal University of California, Berkeley Neyman seminar, Berkeley Julien Mairal Neyman seminar, UC Berkeley /48 Purpose of the talk introduce

More information

Parcimonie en apprentissage statistique

Parcimonie en apprentissage statistique Parcimonie en apprentissage statistique Guillaume Obozinski Ecole des Ponts - ParisTech Journée Parcimonie Fédération Charles Hermite, 23 Juin 2014 Parcimonie en apprentissage 1/44 Classical supervised

More information

Stochastic gradient descent and robustness to ill-conditioning

Stochastic gradient descent and robustness to ill-conditioning Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Sparse Estimation and Dictionary Learning

Sparse Estimation and Dictionary Learning Sparse Estimation and Dictionary Learning (for Biostatistics?) Julien Mairal Biostatistics Seminar, UC Berkeley Julien Mairal Sparse Estimation and Dictionary Learning Methods 1/69 What this talk is about?

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

Proximal Methods for Optimization with Spasity-inducing Norms

Proximal Methods for Optimization with Spasity-inducing Norms Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION

LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION Aryan Mokhtari, Alec Koppel, Gesualdo Scutari, and Alejandro Ribeiro Department of Electrical and Systems

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new

More information

Composite Objective Mirror Descent

Composite Objective Mirror Descent Composite Objective Mirror Descent John C. Duchi 1,3 Shai Shalev-Shwartz 2 Yoram Singer 3 Ambuj Tewari 4 1 University of California, Berkeley 2 Hebrew University of Jerusalem, Israel 3 Google Research

More information

ONLINE VARIANCE-REDUCING OPTIMIZATION

ONLINE VARIANCE-REDUCING OPTIMIZATION ONLINE VARIANCE-REDUCING OPTIMIZATION Nicolas Le Roux Google Brain nlr@google.com Reza Babanezhad University of British Columbia rezababa@cs.ubc.ca Pierre-Antoine Manzagol Google Brain manzagop@google.com

More information

Perturbed Proximal Gradient Algorithm

Perturbed Proximal Gradient Algorithm Perturbed Proximal Gradient Algorithm Gersende FORT LTCI, CNRS, Telecom ParisTech Université Paris-Saclay, 75013, Paris, France Large-scale inverse problems and optimization Applications to image processing

More information

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19 Journal Club A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) CMAP, Ecole Polytechnique March 8th, 2018 1/19 Plan 1 Motivations 2 Existing Acceleration Methods 3 Universal

More information

Block stochastic gradient update method

Block stochastic gradient update method Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic

More information

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices Ryota Tomioka 1, Taiji Suzuki 1, Masashi Sugiyama 2, Hisashi Kashima 1 1 The University of Tokyo 2 Tokyo Institute of Technology 2010-06-22

More information

Linearly-Convergent Stochastic-Gradient Methods

Linearly-Convergent Stochastic-Gradient Methods Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure

More information

Nesterov s Acceleration

Nesterov s Acceleration Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x

More information

Estimators based on non-convex programs: Statistical and computational guarantees

Estimators based on non-convex programs: Statistical and computational guarantees Estimators based on non-convex programs: Statistical and computational guarantees Martin Wainwright UC Berkeley Statistics and EECS Based on joint work with: Po-Ling Loh (UC Berkeley) Martin Wainwright

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Stochastic Optimization Part I: Convex analysis and online stochastic optimization

Stochastic Optimization Part I: Convex analysis and online stochastic optimization Stochastic Optimization Part I: Convex analysis and online stochastic optimization Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical

More information

Backpropagation Rules for Sparse Coding (Task-Driven Dictionary Learning)

Backpropagation Rules for Sparse Coding (Task-Driven Dictionary Learning) Backpropagation Rules for Sparse Coding (Task-Driven Dictionary Learning) Julien Mairal UC Berkeley Edinburgh, ICML, June 2012 Julien Mairal, UC Berkeley Backpropagation Rules for Sparse Coding 1/57 Other

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function

More information

Stochastic Proximal Gradient Algorithm

Stochastic Proximal Gradient Algorithm Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method

More information

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Jinghui Chen Department of Systems and Information Engineering University of Virginia Quanquan Gu

More information

Lecture 3: Minimizing Large Sums. Peter Richtárik

Lecture 3: Minimizing Large Sums. Peter Richtárik Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

Stochastic Optimization: First order method

Stochastic Optimization: First order method Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO

More information

SPARSE SIGNAL RESTORATION. 1. Introduction

SPARSE SIGNAL RESTORATION. 1. Introduction SPARSE SIGNAL RESTORATION IVAN W. SELESNICK 1. Introduction These notes describe an approach for the restoration of degraded signals using sparsity. This approach, which has become quite popular, is useful

More information

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Lecture 2 February 25th

Lecture 2 February 25th Statistical machine learning and convex optimization 06 Lecture February 5th Lecturer: Francis Bach Scribe: Guillaume Maillard, Nicolas Brosse This lecture deals with classical methods for convex optimization.

More information

dans les modèles à vraisemblance non explicite par des algorithmes gradient-proximaux perturbés

dans les modèles à vraisemblance non explicite par des algorithmes gradient-proximaux perturbés Inférence pénalisée dans les modèles à vraisemblance non explicite par des algorithmes gradient-proximaux perturbés Gersende Fort Institut de Mathématiques de Toulouse, CNRS and Univ. Paul Sabatier Toulouse,

More information

A tutorial on sparse modeling. Outline:

A tutorial on sparse modeling. Outline: A tutorial on sparse modeling. Outline: 1. Why? 2. What? 3. How. 4. no really, why? Sparse modeling is a component in many state of the art signal processing and machine learning tasks. image processing

More information

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Simple Optimization, Bigger Models, and Faster Learning. Niao He Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Stochastic Gradient Descent. CS 584: Big Data Analytics

Stochastic Gradient Descent. CS 584: Big Data Analytics Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

A Greedy Framework for First-Order Optimization

A Greedy Framework for First-Order Optimization A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Stochastic Gradient Descent with Only One Projection

Stochastic Gradient Descent with Only One Projection Stochastic Gradient Descent with Only One Projection Mehrdad Mahdavi, ianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi Dept. of Computer Science and Engineering, Michigan State University, MI, USA Machine

More information

Convergence Rate of Expectation-Maximization

Convergence Rate of Expectation-Maximization Convergence Rate of Expectation-Maximiation Raunak Kumar University of British Columbia Mark Schmidt University of British Columbia Abstract raunakkumar17@outlookcom schmidtm@csubcca Expectation-maximiation

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Composite nonlinear models at scale

Composite nonlinear models at scale Composite nonlinear models at scale Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with D. Davis (Cornell), M. Fazel (UW), A.S. Lewis (Cornell) C. Paquette (Lehigh), and S. Roy (UW)

More information

Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Optimal Regularized Dual Averaging Methods for Stochastic Optimization Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business

More information

Homotopy methods based on l 0 norm for the compressed sensing problem

Homotopy methods based on l 0 norm for the compressed sensing problem Homotopy methods based on l 0 norm for the compressed sensing problem Wenxing Zhu, Zhengshan Dong Center for Discrete Mathematics and Theoretical Computer Science, Fuzhou University, Fuzhou 350108, China

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

CS60021: Scalable Data Mining. Large Scale Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance

More information

Convex relaxation for Combinatorial Penalties

Convex relaxation for Combinatorial Penalties Convex relaxation for Combinatorial Penalties Guillaume Obozinski Equipe Imagine Laboratoire d Informatique Gaspard Monge Ecole des Ponts - ParisTech Joint work with Francis Bach Fête Parisienne in Computation,

More information

On Stochastic Primal-Dual Hybrid Gradient Approach for Compositely Regularized Minimization

On Stochastic Primal-Dual Hybrid Gradient Approach for Compositely Regularized Minimization On Stochastic Primal-Dual Hybrid Gradient Approach for Compositely Regularized Minimization Linbo Qiao, and Tianyi Lin 3 and Yu-Gang Jiang and Fan Yang 5 and Wei Liu 6 and Xicheng Lu, Abstract We consider

More information

OWL to the rescue of LASSO

OWL to the rescue of LASSO OWL to the rescue of LASSO IISc IBM day 2018 Joint Work R. Sankaran and Francis Bach AISTATS 17 Chiranjib Bhattacharyya Professor, Department of Computer Science and Automation Indian Institute of Science,

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Adaptive Probabilities in Stochastic Optimization Algorithms

Adaptive Probabilities in Stochastic Optimization Algorithms Research Collection Master Thesis Adaptive Probabilities in Stochastic Optimization Algorithms Author(s): Zhong, Lei Publication Date: 2015 Permanent Link: https://doi.org/10.3929/ethz-a-010421465 Rights

More information

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Finite-sum Composition Optimization via Variance Reduced Gradient Descent Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu

More information

Large-scale machine learning and convex optimization

Large-scale machine learning and convex optimization Large-scale machine learning and convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE IFCAM, Bangalore - July 2014 Big data revolution? A new scientific

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Accelerated Gradient Method for Multi-Task Sparse Learning Problem

Accelerated Gradient Method for Multi-Task Sparse Learning Problem Accelerated Gradient Method for Multi-Task Sparse Learning Problem Xi Chen eike Pan James T. Kwok Jaime G. Carbonell School of Computer Science, Carnegie Mellon University Pittsburgh, U.S.A {xichen, jgc}@cs.cmu.edu

More information

Proximal average approximated incremental gradient descent for composite penalty regularized empirical risk minimization

Proximal average approximated incremental gradient descent for composite penalty regularized empirical risk minimization Mach earn 207 06:595 622 DOI 0.007/s0994-06-5609- Proximal average approximated incremental gradient descent for composite penalty regularized empirical risk minimization Yiu-ming Cheung Jian ou Received:

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Efficient Methods for Overlapping Group Lasso

Efficient Methods for Overlapping Group Lasso Efficient Methods for Overlapping Group Lasso Lei Yuan Arizona State University Tempe, AZ, 85287 Lei.Yuan@asu.edu Jun Liu Arizona State University Tempe, AZ, 85287 j.liu@asu.edu Jieping Ye Arizona State

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Structured sparsity-inducing norms through submodular functions

Structured sparsity-inducing norms through submodular functions Structured sparsity-inducing norms through submodular functions Francis Bach Sierra team, INRIA - Ecole Normale Supérieure - CNRS Thanks to R. Jenatton, J. Mairal, G. Obozinski June 2011 Outline Introduction:

More information