Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning

Size: px

Start display at page:

Download "Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning"

Paul Willis
5 years ago
Views:

1 Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Julien Mairal Inria, LEAR Team, Grenoble Journées MAS, Toulouse Julien Mairal Incremental and Stochastic MM Algorithms 1/27

2 Statistical modeling with regularized risk minimization Given some data points x i, i = 1,...,n, learn some model parameters θ in R p by minimizing 1 min θ R p n n l(x i,θ)+λψ(θ), i=1 where l measures the data fit, and ψ is a regularizer. The goal of this work is to deal with large n for relatively non-standard settings (non-convex,non-smooth,stochastic) Julien Mairal Incremental and Stochastic MM Algorithms 2/27

3 A simple (naive) optimization principle g(θ) κ f(θ) Objective: min θ Θ f(θ) Principle called Majorization-Minimization [Lange et al., 2000]; quite popular in statistics and signal processing. Julien Mairal Incremental and Stochastic MM Algorithms 3/27

4 In this work g(θ) κ f(θ) scalable Majorization-Minimization algorithms; for convex or non-convex and smooth or non-smooth problems; References J. Mairal. Optimization with First-Order Surrogate Functions. ICML 13; J. Mairal. Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization. NIPS 13. Julien Mairal Incremental and Stochastic MM Algorithms 4/27

5 In this work Methodology extend the MM principle to a large variety of settings; compute convergence rates for convex problems; show stationary point conditions for non-convex ones. First direction: incremental optimization minimizes (1/n) n i=1 f i (θ); requires some memory about past iterates; fast convergence rate for several passes over the data. Second direction: stochastic optimization no memory about past iterates; minimizes E x [f(θ,x)]. Julien Mairal Incremental and Stochastic MM Algorithms 5/27

6 Related work incremental approaches for convex optimization stochastic average gradient [Schmidt, Roux, and Bach, 2013]; stochastic dual coordinate ascent [Shalev-Schwartz and Zhang, 2012]. stochastic optimization stochastic proximal methods, e.g., [Duchi and Singer, 2009, Atchade et al., 2014]; literature about stochastic gradient descent, see, e.g., [Nemirovski et al., 2009]; non-convex optimization DC programming, see, e.g., [Gasso et al., 2009]; online EM [Neal and Hinton, 1998, Cappé and Moulines, 2009]. Julien Mairal Incremental and Stochastic MM Algorithms 6/27

7 Setting: First-Order Surrogate Functions h(θ) g(θ) κ f(θ) g(θ ) f(θ ) for all θ in argmin θ Θ g(θ); the approximation error h = g f is differentiable, and h is L-Lipschitz. Moreover, h(κ) = 0 and h(κ) = 0; we sometimes assume g to be strongly convex. Julien Mairal Incremental and Stochastic MM Algorithms 7/27

8 The Basic MM Algorithm Algorithm 1 Basic Majorization-Minimization Scheme 1: Input: θ 0 Θ (initial estimate); T (number of iterations). 2: for t = 1,...,T do 3: Compute a surrogate g t of f near θ t 1 ; 4: Minimize g t and update the solution: 5: end for 6: Output: θ T (final estimate); θ t argming t (θ). θ Θ Julien Mairal Incremental and Stochastic MM Algorithms 8/27

9 Examples of First-Order Surrogate Functions Lipschitz Gradient Surrogates: f is L-smooth (differentiable + L-Lipschitz gradient). g : θ f(κ)+ f(κ) (θ κ)+ L 2 θ κ 2 2. Minimizing g yields a gradient descent step θ κ 1 L f(κ). Julien Mairal Incremental and Stochastic MM Algorithms 9/27

10 Examples of First-Order Surrogate Functions Lipschitz Gradient Surrogates: f is L-smooth (differentiable + L-Lipschitz gradient). g : θ f(κ)+ f(κ) (θ κ)+ L 2 θ κ 2 2. Minimizing g yields a gradient descent step θ κ 1 L f(κ). Proximal Gradient Surrogates: f = f 1 +f 2 with f 1 smooth. g : θ f 1 (κ)+ f 1 (κ) (θ κ)+ L 2 θ κ 2 2 +f 2 (θ). Minimizing g amounts to one step of the forward-backward, ISTA, or proximal gradient descent algorithm. [Beck and Teboulle, 2009, Combettes and Pesquet, 2010, Wright et al., 2008, Nesterov, 2007]. Julien Mairal Incremental and Stochastic MM Algorithms 9/27

11 Examples of First-Order Surrogate Functions Linearizing Concave Functions and DC-Programming: f = f 1 +f 2 with f 2 smooth and concave. g : θ f 1 (θ)+f 2 (κ)+ f 2 (κ) (θ κ). When f 1 is convex, the algorithm is called DC-programming. Julien Mairal Incremental and Stochastic MM Algorithms 10/27

12 Examples of First-Order Surrogate Functions Linearizing Concave Functions and DC-Programming: f = f 1 +f 2 with f 2 smooth and concave. g : θ f 1 (θ)+f 2 (κ)+ f 2 (κ) (θ κ). When f 1 is convex, the algorithm is called DC-programming. Quadratic Surrogates: f is twice differentiable, and H is a uniform upper bound of 2 f: g : θ f(κ)+ f(κ) (θ κ)+ 1 2 (θ κ) H(θ κ). Actually a big deal in statistics and machine learning [Böhning and Lindsay, 1988, Khan et al., 2010, Jebara and Choromanska, 2012].... Julien Mairal Incremental and Stochastic MM Algorithms 10/27

13 Theoretical Guarantees When using first-order surrogates, for convex problems: f(θ t ) f = O(1/t). for µ-strongly convex ones: O((1 µ/l) t ). for non-convex problems: f(θ t ) monotonically decreases and lim inf inf f(θ t,θ θ t ) 0, (1) t + θ Θ θ θ t 2 which we call asymptotic stationary point condition. Directional derivative f(θ +εκ) f(θ) f(θ,κ) = lim. ε 0 + ε when in addition Θ = R p, (1) is equivalent to f(θ t ) 0. Julien Mairal Incremental and Stochastic MM Algorithms 11/27

14 Incremental Optimization: MISO Suppose that f splits into many components: f(θ) = 1 n n f i (θ). i=1 Recipe Incrementally update an approximate surrogate 1 n n i=1 gi ; add some heuristics for practical implementations. Related work for convex problems related to SAG [Schmidt et al., 2013] and SDCA [Shalev-Schwartz and Zhang, 2012], but offers different update rules. Julien Mairal Incremental and Stochastic MM Algorithms 12/27

15 Incremental Optimization: MISO Algorithm 2 Incremental Scheme MISO 1: Input: θ 0 Θ; T (number of iterations). 2: Choose surrogates g0 i of f i near θ 0 for all i; 3: for t = 1,...,T do 4: Randomly pick up one index î t and choose a surrogate gît t of f ît near θ t 1. Set gt i = gt 1 i for i î t; 5: Update the solution: 6: end for 7: Output: θ T (final estimate); 1 θ t argmin θ Θ n n gt(θ). i i=1 Julien Mairal Incremental and Stochastic MM Algorithms 13/27

16 Incremental Optimization: MISO Update rule with Lipschitz gradient surrogates We want to minimize 1 n n i=1 f i (θ). θ t = argmin θ Θ n = 1 n i=1 1 n n f i (κ i )+ f i (κ i ) (θ κ i )+ L 2 θ κi 2 2 i=1 κ i 1 Ln n f i (κ i ). i=1 At iteration n, randomly draw one index î t, and update κît θ t. Remarks replace (1/n) n i=1 κi by θ t 1 yields SAG [Schmidt et al., 2013]. replace (1/L) by (1/µ) is almost identical to SDCA [Shalev-Schwartz and Zhang, 2012]. Julien Mairal Incremental and Stochastic MM Algorithms 14/27

17 Incremental Optimization: MISO Update rule for proximal gradient surrogates We want to minimize 1 n n i=1 f i (θ)+ψ(θ). θ t = argmin θ Θ = argmin θ Θ 1 n 1 2 n f i (κ i t)+ f i (κ i t) (θ κ i t)+ L 2 θ κi t 2 2 +ψ(θ) i=1 θ ( 1 n n i=1 κ i t 1 Ln n 2 f i (κt)) i + 1 L ψ(θ). i=1 2 Julien Mairal Incremental and Stochastic MM Algorithms 15/27

18 Incremental Optimization: MISO Theoretical Guarantees for non-convex problems, the guarantees are the same as the generic MM algorithm with probability one. for convex problems and proximal gradient surrogates, the expected convergence rate with averaging becomes O(n/t). for µ-strongly convex problems and proximal gradient surrogates, the expected convergence rate is linear O((1 µ/(nl)) t ). Remarks for µ-strongly convex problems the rates of SDCA and SAG in this setting are better: µ/(ln) is replaced by O(min(µ/L, 1/n)); the MM principle is too conservative. For smooth problems, we can match these rates by using minorizing surrogates [Mairal, 2014]. Julien Mairal Incremental and Stochastic MM Algorithms 16/27

19 Incremental Optimization: MISO Example for l 2 -logistic regression: 1 min θ R p n n log(1+e y iθ x i )+ λ 2 θ 2 2. i=1 The problem is λ-strongly convex. Table : Description of datasets used in our experiments. name n p storage density size (GB) alpha dense ocr dense rcv sparse webspam sparse Julien Mairal Incremental and Stochastic MM Algorithms 17/27

20 Incremental Optimization: MISO Relative duality gap Relative duality gap SGD h FISTA LS SDCA SAG MISO0 MISO2 MISO µ Effective passes over data / Dataset alpha Effective passes over data / Dataset ocr Relative duality gap Relative duality gap SGD h FISTA LS SDCA SAG MISO0 mb MISO2 mb MISO µ Effective passes over data / Dataset rcv Effective passes over data / Dataset webspam Julien Mairal Incremental and Stochastic MM Algorithms 18/27

21 Incremental DC programming Consider a binary classification problem with n training samples (y i,x i ), with y i in { 1,+1} and x i in R p. Assume that there exists a sparse linear model y sign(θ x), learned by minimizing 1 min θ R p n n log(1+e y iθ x i )+λψ(θ). i=1 Traditional choices for ψ: ψ(θ) = θ 2 2 or θ 1. Non-convex sparsity inducing penalty: ψ(θ) = p j=1 log( θ[j] +ε). ψ(θ) θ Julien Mairal Incremental and Stochastic MM Algorithms 19/27

22 Incremental DC programming upper-bound f i : θ log(1+e y iθ x i ) by θ f i (κ i )+ f i (θ t 1 ) (θ θ t 1 )+ L 2 θ θ t 1 2 2; upper-bound λ p j=1log( θ[j] +ε) by θ λ p j=1 θ[j] θ t 1 [j] +ε. this is an incremental reweighted-l 1 algorithm [Candès et al., 2008]. The overall surrogate can be minimized in closed-form by using soft-thresholding. Julien Mairal Incremental and Stochastic MM Algorithms 20/27

23 Incremental Optimization: MISO Objective Function Objective Function Effective passes over data / Dataset alpha Effective passes over data / Dataset rcv1 Objective Function Objective Function Effective passes over data / Dataset ocr 366 MM MM LS MISO0 MISO Effective passes over data / Dataset webspam Julien Mairal Incremental and Stochastic MM Algorithms 21/27

24 Stochastic Majorization Minimization: SMM Suppose that f is an expectation: f(θ) = E x [l(θ,x)]. Recipe Draw a function f t : θ l(θ,x t ) at iteration t; Iteratively update an approximate surrogate ḡ t = (1 w t )ḡ t 1 +w t g t ; Choose appropriate w t. Related Work online-em [Neal and Hinton, 1998, Cappé and Moulines, 2009]; online dictionary learning [Mairal et al., 2010a]. Julien Mairal Incremental and Stochastic MM Algorithms 22/27

25 Stochastic Majorization Minimization: SMM Algorithm 3 Stochastic Majorization-Minimization Scheme 1: Input: θ 0 Θ (initial estimate); T (number of iterations); (w t ) t 1, weights in (0,1]; 2: initialize the approximate surrogate: ḡ 0 : θ ρ 2 θ θ ; 3: for t = 1,...,T do 4: draw a training point x t ; 5: choose a surrogate function g t of f t : θ l(x t,θ) near θ t 1 ; 6: update the approximate surrogate: ḡ t = (1 w t )ḡ t 1 +w t g t ; 7: update the current estimate: 8: end for 9: Output: θ T (current estimate); θ t argminḡ t (θ); θ Θ Julien Mairal Incremental and Stochastic MM Algorithms 23/27

26 Stochastic Majorization Minimization: SMM Update Rule for Proximal Gradient Surrogate θ t argmin θ Θ t i=1 w i t [ fi (θ i 1 ) θ + L 2 θ θ i ψ(θ)]. (SMM) Other schemes in the literature [Duchi and Singer, 2009]: θ t argmin f t (θ t 1 ) θ + 1 2η t θ θ t ψ(θ), θ Θ (FOBOS) or regularized dual averaging (RDA) of Xiao [2010]: θ t argmin θ Θ 1 t t i=1 f i (θ i 1 ) θ + 1 2η t θ 2 2 +ψ(θ). (RDA) or others... Julien Mairal Incremental and Stochastic MM Algorithms 24/27

27 Stochastic Majorization Minimization: SMM Theoretical Guarantees - Non-Convex Problems under a set of reasonable assumptions, f(θ t ) almost surely converges; the function ḡ t asymptotically behaves as a first-order surrogate; we almost surely have asymptotic stationary point conditions. Theoretical Guarantees - Convex Problems for proximal gradient surrogates, we obtain similar expected rates as SGD with averaging [see Nemirovski et al., 2009]: O(1/t) for strongly convex problems, O(log(t)/ t) for convex ones. (under bounded subgradients assumptions and specific w t ). Julien Mairal Incremental and Stochastic MM Algorithms 25/27

28 Experimental Conclusions for l 2 -logistic Regression Incremental and stochastic schemes were significantly faster than batch ones; MISO with heuristics was competitive with the state of the art (SAG, SGD, Liblinear); after one pass over the data, SMM quickly achieves a low-precision solution. For higher precision, MISO is prefered. problems tested were large but relatively well conditioned. Julien Mairal Incremental and Stochastic MM Algorithms 26/27

29 Conclusion What we have done we have given a unified view of a large number of algorithms;... and introduced new ones for large-scale optimization. A take-home message our algorithms are likely to be useful for large-scale non-convex and possibly non-smooth problems, which is a relatively non-standard, but useful, setting. Source Code code is now available in the toolbox SPAMS (C++ interfaced with Matlab, Python, R). Julien Mairal Incremental and Stochastic MM Algorithms 27/27

30 Examples of First-Order Surrogate Functions More Exotic Surrogates: Consider a smooth approximation of the trace (nuclear) norm see François Caron s talk) ( f µ : θ Tr (θ θ +µi) 1/2) = p i=1 λ i (θ θ)+µ, f : H Tr ( H 1/2) is concave on the set of p.d. matrices and f (H) = (1/2)H 1/2. g µ : θ f µ (κ)+ 1 ) ((κ 2 Tr κ+µi) 1/2 (θ θ κ κ), which yields the algorithm of Mohan and Fazel [2012]. a and also variational, saddle-point, Jensen surrogates... Julien Mairal Incremental and Stochastic MM Algorithms 28/27

31 Examples of First-Order Surrogate Functions Variational Surrogates: f(θ 1 ) = min θ2 Θ 2 f(θ 1,θ 2 ), where f is smooth w.r.t θ 1 and strongly convex w.r.t θ 2 : g : θ 1 f(θ 1,κ 2) with κ 2 = argmin θ 2 Θ 2 f(κ1,θ 2 ). Saddle-Point Surrogates: f(θ 1 ) = max θ2 Θ 2 f(θ 1,θ 2 ), where f is smooth w.r.t θ 1 and strongly concave w.r.t θ 2 : g : θ 1 f(θ 1,κ 2)+ L 2 θ 1 κ Jensen Surrogates: f(θ) = f(x θ), where f is L-smooth. Choose a weight vector w in R p + w 1 = 1 and w i 0 whenever x i 0. p ( ) xi g : θ w i f (θ i κ i )+x κ, w i i=1 such that Julien Mairal Incremental and Stochastic MM Algorithms 29/27

32 Stochastic DC programming For logistic-regression with non-convex sparsity-inducing penalty. Objective on Train Set Objective on Train Set Online DC Batch DC Iterations Epochs / Dataset rcv1 Online DC Batch DC Iterations Epochs / Dataset webspam Objective on Test Set Objective on Test Set Online DC Batch DC Iterations Epochs / Dataset rcv Online DC Batch DC Iterations Epochs / Dataset webspam Julien Mairal Incremental and Stochastic MM Algorithms 30/27

33 Other variants of MM We also study in [Mairal, 2013a] a block coordinate scheme for non-convex and convex optimization. Also several variants for convex optimization: an accelerated one (Nesterov s like); a Frank-Wolfe majorization-minimization algorithm. Julien Mairal Incremental and Stochastic MM Algorithms 31/27

34 Online Dictionary Learning Experimental results, batch vs online k = 256 m = 8 8, Julien Mairal Incremental and Stochastic MM Algorithms 32/27

35 Online Dictionary Learning Experimental results, batch vs online m = , k = 512 Julien Mairal Incremental and Stochastic MM Algorithms 33/27

36 Online Dictionary Learning Experimental results, batch vs online k = 1024 m = 16 16, Julien Mairal Incremental and Stochastic MM Algorithms 34/27

37 Online Structured Matrix Factorization With a structured regularization function ϕ [Jenatton et al., 2009] ϕ(d) = γ K 1 j=1 g G max k g d j [k] +γ 2 D 2 F. The proximal operator of ϕ can be computed by using network flow optimization [Mairal et al., 2010b]. Figure : Left: subset of a larger dictionary obtained with l 1 ; Right: subset obtained with ϕ after initialization with the dictionary on the left. About 20 minutes per pass on the data on the 1.2GHz laptop CPU. Julien Mairal Incremental and Stochastic MM Algorithms 35/27

38 Online Sparse Matrix Factorization Consider some signals x in R m. We want to find a dictionary D in R m K. The quality of D is measured through the loss l(x,d) = 1 min α R K 2 x Dα 2 2 +λ 1 α 1 + λ 2 2 α 2 2. Then, learning the dictionary amounts to solving min E x[l(x,d)]+ϕ(d), D C Why is it a matrix factorization problem? [ 1 1 min D C,A R K n n 2 X DA 2 F + n i=1 λ 1 α i 1 + λ 2 2 α i 2 2 ] +ϕ(d). Julien Mairal Incremental and Stochastic MM Algorithms 36/27

39 Online Structured Matrix Factorization when C = {D R m K s.t. d j 2 1} and ϕ = 0, the problem is called sparse coding or dictionary learning [Olshausen and Field, 1997, Elad and Aharon, 2006, Mairal et al., 2010a]. non-negativity constraints can be easily added. It yields an online nonnegative matrix factorization algorithm. ϕ can be a function encouraging a particular structure in D [Jenatton et al., 2009]. Julien Mairal Incremental and Stochastic MM Algorithms 37/27

40 Online Structured Matrix Factorization Dictionary Learning on Natural Image Patches Consider n = whitened natural image patches of size m = We learn a dictionary with K = 256 elements. 0s on an old laptop 1.2GHz dual-core CPU. (initialization) Julien Mairal Incremental and Stochastic MM Algorithms 38/27

41 Online Structured Matrix Factorization Dictionary Learning on Natural Image Patches Consider n = whitened natural image patches of size m = We learn a dictionary with K = 256 elements. 1.15s on an old laptop 1.2GHz dual-core CPU (0.1 pass) Julien Mairal Incremental and Stochastic MM Algorithms 28/27

Online Structured Matrix Factorization Dictionary Learning on Natural Image Patches Consider n = 250000 whitened natural image patches of size m = 12 12.

42 Online Structured Matrix Factorization Dictionary Learning on Natural Image Patches Consider n = whitened natural image patches of size m = We learn a dictionary with K = 256 elements. 5.97s on an old laptop 1.2GHz dual-core CPU (0.5 pass) Julien Mairal Incremental and Stochastic MM Algorithms 28/27

43 Online Structured Matrix Factorization Dictionary Learning on Natural Image Patches Consider n = whitened natural image patches of size m = We learn a dictionary with K = 256 elements s on an old laptop 1.2GHz dual-core CPU (1 pass) Julien Mairal Incremental and Stochastic MM Algorithms 28/27

44 Online Structured Matrix Factorization Dictionary Learning on Natural Image Patches Consider n = whitened natural image patches of size m = We learn a dictionary with K = 256 elements s on an old laptop 1.2GHz dual-core CPU (2 passes) Julien Mairal Incremental and Stochastic MM Algorithms 28/27

45 Online Structured Matrix Factorization Dictionary Learning on Natural Image Patches Consider n = whitened natural image patches of size m = We learn a dictionary with K = 256 elements s on an old laptop 1.2GHz dual-core CPU (5 passes) Julien Mairal Incremental and Stochastic MM Algorithms 28/27

46 References I Y. F Atchade, G. Fort, and E. Moulines. On stochastic proximal gradient algorithms. arxiv preprint arxiv: , A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1): , D. Böhning and B. G. Lindsay. Monotonicity of quadratic-approximation algorithms. Annals of the Institute of Statistical Mathematics, 40(4): , E. J. Candès, M. Wakin, and S. Boyd. Enhancing sparsity by reweighted l1 minimization. Journal of Fourier Analysis and Applications, 14: , O. Cappé and E. Moulines. On-line expectation maximization algorithm for latent data models. 71(3): , Julien Mairal Incremental and Stochastic MM Algorithms 29/27

47 References II P. L. Combettes and J.-C. Pesquet. Proximal splitting methods in signal processing. In Fixed-Point Algorithms for Inverse Problems in Science and Engineering. Springer, J. Duchi and Y. Singer. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10: , M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image Processing, 54(12): , December G. Gasso, A. Rakotomamonjy, and S. Canu. Recovering sparse signals with non-convex penalties and DC programming. IEEE Transactions on Signal Processing, 57(12): , T. Jebara and A. Choromanska. Majorization for CRFs and latent likelihoods. In Advances in Neural Information Processing Systems, Julien Mairal Incremental and Stochastic MM Algorithms 30/27

48 References III R. Jenatton, J-Y. Audibert, and F. Bach. Structured variable selection with sparsity-inducing norms. Technical report, preprint arxiv: v1. Emtiyaz Khan, Ben Marlin, Guillaume Bouchard, and Kevin Murphy. Variational bounds for mixed-data factor analysis. In Advances in Neural Information Processing Systems, K. Lange, D.R. Hunter, and I. Yang. Optimization transfer using surrogate objective functions. 9(1):1 20, J. Mairal. Optimization with first-order surrogate functions. In Proceedings of the International Conference on Machine Learning (ICML), 2013a. J. Mairal. Stochastic majorization-minimization algorithms for large-scale optimization. In Advances in Neural Information Processing Systems, 2013b. Julien Mairal Incremental and Stochastic MM Algorithms 31/27

49 References IV J. Mairal. Incremental majorization-minimization optimization with application to large-scale machine learning. preprint arxiv: , J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 2010a. J. Mairal, R. Jenatton, G. Obozinski, and F. Bach. Network flow algorithms for structured sparsity. In Advances in Neural Information Processing Systems, 2010b. K. Mohan and M. Fazel. Iterative reweighted algorithms for matrix rank minimization. Journal of Machine Learning Research, (13): , R.M. Neal and G.E. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. Learning in graphical models, 89, Julien Mairal Incremental and Stochastic MM Algorithms 32/27

50 References V A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. 19(4): , Y. Nesterov. Gradient methods for minimizing composite objective function. Technical report, CORE, B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37: , Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. arxiv: , S. Shalev-Schwartz and T. Zhang. Proximal stochastic dual coordinate ascent. preprint arxiv v1, S. Wright, R. Nowak, and M. Figueiredo. Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing, Julien Mairal Incremental and Stochastic MM Algorithms 33/27

51 References VI L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11: , Julien Mairal Incremental and Stochastic MM Algorithms 34/27

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine