Optimization methods for large-scale machine learning and sparse estimation. Julien Mairal

Size: px
Start display at page:

Download "Optimization methods for large-scale machine learning and sparse estimation. Julien Mairal"

Transcription

1 Optimization methods for large-scale machine learning and sparse estimation Julien Mairal Inria Grenoble Nantes, Mascot-Num, 2018 Part II Julien Mairal Optim for large-scale ML and sparse estimation 1/71

2 Common paradigm: optimization for machine learning Optimization is central to machine learning. For instance, in supervised learning, the goal is to learn a prediction function f : X Y given labeled training data (x i,y i ) i=1,...,n with x i in X, and y i in Y: min f F 1 n L(y i,f(x i )) n i=1 }{{} empirical risk, data fit + λω(f). }{{} regularization [Vapnik, 1995, Bottou, Curtis, and Nocedal, 2016]... Julien Mairal Optim for large-scale ML and sparse estimation 2/71

3 Paradigm 3: The sparsity principle Let us consider again the classical scientific paradigm: 1 observe the world (gather data); 2 propose models of the world (design and learn); 3 test on new data (estimate the generalization error). [Corfield et al., 2009]. Julien Mairal Optim for large-scale ML and sparse estimation 3/71

4 Paradigm 3: The sparsity principle Let us consider again the classical scientific paradigm: 1 observe the world (gather data); 2 propose models of the world (design and learn); 3 test on new data (estimate the generalization error). But... it is not always possible to distinguish the generalization error of various models based on available data. when a complex model A performs slightly better than a simple model B, should we prefer A or B? generalization error requires a predictive task: what about unsupervised learning? which measure should we use? we are also leaving aside the problem of non i.i.d. train/test data, biased data, testing with counterfactual reasoning... [Corfield et al., 2009, Bottou et al., 2013, Schölkopf et al., 2012]. Julien Mairal Optim for large-scale ML and sparse estimation 3/71

5 Paradigm 3: The sparsity principle (a) Dorothy Wrinch (b) Harold Jeffreys The existence of simple laws is, then, apparently, to be regarded as a quality of nature; and accordingly we may infer that it is justifiable to prefer a simple law to a more complex one that fits our observations slightly better. [Wrinch and Jeffreys, 1921]. Julien Mairal Optim for large-scale ML and sparse estimation 4/71

6 Paradigm 3: The sparsity principle Remarks: sparsity is... appealing for experimental sciences for model interpretation; (too-)well understood in some mathematical contexts: 1 min w R p n n i=1 ) L (y i,w x i } {{ } empirical risk, data fit + λ w }{{} 1. regularization extremely powerful for unsupervised learning in the context of matrix factorization, and simple to use. [Olshausen and Field, 1996, Chen, Donoho, and Saunders, 1999, Tibshirani, 1996]... Julien Mairal Optim for large-scale ML and sparse estimation 5/71

7 Paradigm 3: The sparsity principle Remarks: sparsity is... appealing for experimental sciences for model interpretation; (too-)well understood in some mathematical contexts: 1 min w R p n n i=1 ) L (y i,w x i } {{ } empirical risk, data fit + λ w }{{} 1. regularization extremely powerful for unsupervised learning in the context of matrix factorization, and simple to use. Today s challenges Develop sparse and stable (and invariant?) models. Go beyond clustering / low-rank / union of subspaces. [Olshausen and Field, 1996, Chen, Donoho, and Saunders, 1999, Tibshirani, 1996]... Julien Mairal Optim for large-scale ML and sparse estimation 5/71

8 Some references On kernel methods B. Schölkopf and A. J. Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond J. Shawe-Taylor and N. Cristianini. An introduction to support vector machines and other kernel-based learning methods slides: On sparse estimation M. Elad. Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing J. Mairal, F. Bach, and J. Ponce. Sparse Modeling for Image and Vision Processing free online. Julien Mairal Optim for large-scale ML and sparse estimation 6/71

9 Some references On large-scale optimization L. Bottou, F. E. Curtis and J. Nocedal. Optimization methods for large-scale machine learning, preprint arxiv: , Y. Nesterov. Introductory lectures on convex optimization: A basic course. Springer S. Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning slides by F. Bach: Julien Mairal Optim for large-scale ML and sparse estimation 7/71

10 Material on sparse estimation (freely available on arxiv) J. Mairal, F. Bach and J. Ponce. Sparse Modeling for Image and Vision Processing. Foundations and Trends in Computer Graphics and Vision F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning, 4(1) Julien Mairal Optim for large-scale ML and sparse estimation 8/71

11 Part I: Large-scale optimization for machine learning Julien Mairal Optim for large-scale ML and sparse estimation 9/71

12 Focus of this part Minimizing large finite sums Consider the minimization of a large sum of convex functions { } f(x) = 1 n f i (x)+ψ(x), n min x R d i=1 where each f i is L-smooth and convex and ψ is a convex regularization penalty but not necessarily differentiable. Julien Mairal Optim for large-scale ML and sparse estimation 10/71

13 Focus of this part Minimizing large finite sums Consider the minimization of a large sum of convex functions { } f(x) = 1 n f i (x)+ψ(x), n min x R d i=1 where each f i is L-smooth and convex and ψ is a convex regularization penalty but not necessarily differentiable. Why this setting? convexity makes it easy to obtain complexity bounds. convex optimization is often effective for non-convex problems. What we will not cover performance of approaches in terms of test error. Julien Mairal Optim for large-scale ML and sparse estimation 10/71

14 Introduction of a few optimization principles Convex Functions Why do we care about convexity? f(x) x Julien Mairal Optim for large-scale ML and sparse estimation 11/71

15 Introduction of a few optimization principles Convex Functions Local observations give information about the global optimum f(x) x x f(x) = 0 is a necessary and sufficient optimality condition for differentiable convex functions; it is often easy to upper-bound f(x) f. Julien Mairal Optim for large-scale ML and sparse estimation 11/71

16 Introduction of a few optimization principles An important inequality for L-smooth convex functions If f is convex and smooth f(x) x x 0 x f(x) f(x 0 )+ f(x 0 ) (x x 0 ); }{{} linear approximation if f is non-smooth, a similar inequality holds for subgradients. Julien Mairal Optim for large-scale ML and sparse estimation 12/71

17 Introduction of a few optimization principles An important inequality for smooth functions If f is L-Lipschitz continuous (f does not need to be convex) g(x) f(x) x x 1 x 0 x f(x) g(x) = f(x 0 )+ f(x 0 ) (x x 0 ) + L }{{} 2 x x ; linear approximation Julien Mairal Optim for large-scale ML and sparse estimation 13/71

18 Introduction of a few optimization principles An important inequality for smooth functions If f is L-Lipschitz continuous (f does not need to be convex) g(x) f(x) x x 1 x 0 x f(x) g(x) = f(x 0 )+ f(x 0 ) (x x 0 ) + L }{{} 2 x x ; linear approximation g(x) = C x0 + L 2 x 0 (1/L) f(x 0 ) x 2 2. Julien Mairal Optim for large-scale ML and sparse estimation 13/71

19 Introduction of a few optimization principles An important inequality for smooth functions If f is L-Lipschitz continuous (f does not need to be convex) g(x) f(x) x x 1 x 0 x f(x) g(x) = f(x 0 )+ f(x 0 ) (x x 0 ) + L }{{} 2 x x ; linear approximation x 1 = x 0 1 L f(x 0). (gradient descent step). Julien Mairal Optim for large-scale ML and sparse estimation 13/71

20 Introduction of a few optimization principles Gradient Descent Algorithm Assume that f is convex and L-smooth ( f is L-Lipschitz). Theorem Consider the algorithm x t x t 1 1 L f(x t 1). Then, f(x t ) f L x 0 x t Julien Mairal Optim for large-scale ML and sparse estimation 14/71

21 Proof (1/2) Proof of the main inequality for smooth functions We want to show that for all x and z, f(x) f(z)+ f(z) (x z)+ L 2 x z 2 2. Julien Mairal Optim for large-scale ML and sparse estimation 15/71

22 Proof (1/2) Proof of the main inequality for smooth functions We want to show that for all x and z, f(x) f(z)+ f(z) (x z)+ L 2 x z 2 2. By using Taylor s theorem with integral form, Then, f(x) f(z) = 1 0 f(tx+(1 t)z) (x z)dt. f(x) f(z) f(z) (x z) ( f(tx+(1 t)z) f(z)) (x z)dt ( f(tx+(1 t)z) f(z)) (x z) dt f(tx+(1 t)z) f(z) 2 x z 2dt (C.-S.) Lt x z 2 2dt = L 2 x z 2 2. Julien Mairal Optim for large-scale ML and sparse estimation 15/71

23 Proof (2/2) Proof of the theorem We have shown that for all x, f(x) g t(x) = f(x t 1)+ f(x t 1) (x x t 1)+ L 2 x xt g t is minimized by x t; it can be rewritten g t(x) = g t(x t)+ L 2 x xt 2 2. Then, f(x t) g t(x t) = g t(x ) L 2 x x t 2 2 = f(x t 1)+ f(x t 1) (x x L t 1)+ 2 x x t L 2 x x t 2 2 f + L 2 x x t L 2 x x t 2 2. By summing from t = 1 to T, we have a telescopic sum T T(f(x T) f ) f(x t) f L 2 x x L 2 x x T 2 2. t=1 Julien Mairal Optim for large-scale ML and sparse estimation 16/71

24 Introduction of a few optimization principles An important inequality for smooth and µ-strongly convex functions If f is L-Lipschitz continuous and f µ-strongly convex f(x) x x 0 x f(x) f(x 0 )+ f(x 0 ) (x x 0 )+ L 2 x x ; f(x) f(x 0 )+ f(x 0 ) (x x 0 )+ µ 2 x x ; Julien Mairal Optim for large-scale ML and sparse estimation 17/71

25 Introduction of a few optimization principles Proposition When f is µ-strongly convex and L-smooth, the gradient descent algorithm with step-size 1/L produces iterates such that f(x t ) f We call that a linear convergence rate. ( 1 µ ) t L x0 x 2 2. L 2 Remarks if f is twice differentiable, L and µ represent the larget and smallest eigenvalues of the Hessian, respectively. L/µ is called the condition number. Julien Mairal Optim for large-scale ML and sparse estimation 18/71

26 Proof We start from an inequality from the previous proof f(x t) f(x t 1)+ f(x t 1) (x x t 1)+ L 2 x x t L 2 x x t 2 2 f + L µ x x t L 2 2 x x t 2 2. In addition, we have that f(x t) f + µ 2 xt x 2 2, and thus Finally, x x t 2 2 L µ L+µ x x t ( 1 µ ) x x t L f(x t) f L 2 xt x 2 2 ( 1 µ ) t L x x L 2 Julien Mairal Optim for large-scale ML and sparse estimation 19/71

27 Introduction of a few optimization principles Remark: with stepsize 1/L, gradient descent may be interpreted as a majorization-minimization algorithm: g y (x) x new y = x old f(x) g y (x) f(x) Figure: At each step, we update x argmin x R p g y (x) Julien Mairal Optim for large-scale ML and sparse estimation 20/71

28 The proximal gradient method An important inequality for composite functions If f 0 is L-Lipschitz continuous g(x) f(x) = f 0 (x)+ψ(x) x x 1 x 0 x f 0 (x)+ψ(x) f 0 (x 0 )+ f 0 (x 0 ) (x x 0 )+ L 2 x x ψ(x); x 1 minimizes g. Julien Mairal Optim for large-scale ML and sparse estimation 21/71

29 The proximal gradient method Gradient descent for minimizing f consists of x t argmin x R p g t (x) x t x t 1 1 L f(x t 1). The proximal gradient method for minimizing f = f 0 +ψ consists of which is equivalent to 1 x t argmin x R p 2 x t argming t (x), x R p x t 1 1 L f 0(x t 1 ) x L ψ(x). It requires computing efficiently the proximal operator of ψ. 1 y argmin x R p 2 y x 2 2 +ψ(x). Julien Mairal Optim for large-scale ML and sparse estimation 22/71

30 The proximal gradient method Remarks also known as forward-backward algorithm; has similar convergence rates as the gradient descent method (the proof is nearly identical). there exists line search schemes to automatically tune L; The case of l 1 The proximal operator of λ. 1 is the soft-thresholding operator x[j] = sign(y[j])( y[j] λ) +. The resulting algorithm is called iterative soft-thresholding. [Nowak and Figueiredo, 2001, Daubechies et al., 2004, Combettes and Wajs, 2006, Beck and Teboulle, 2009, Wright et al., 2009, Nesterov, 2013]... Julien Mairal Optim for large-scale ML and sparse estimation 23/71

31 The proximal gradient method The proximal operator for the group Lasso penalty 1 min x R p 2 y x 2 2 +λ x[g] q. g G For q = 2, For q =, x[g] = y[g] y[g] 2 ( y[g] 2 λ) +, g G. x[g] = y[g] Π. 1 λ[y[g]], g G. These formula generalize soft-thresholding to groups of variables. Julien Mairal Optim for large-scale ML and sparse estimation 24/71

32 The proximal gradient method A few proximal operators: l 0 -penalty: hard-thresholding; l 1 -norm: soft-thresholding; group-lasso: group soft-thresholding; fused-lasso (1D total variation): [Hoefling, 2010]; total variation: [Chambolle and Darbon, 2009]; hierarchical norms: [Jenatton et al., 2011], O(p) complexity; overlapping group Lasso with l -norm: [Mairal et al., 2010]; Julien Mairal Optim for large-scale ML and sparse estimation 25/71

33 Accelerated gradient descent methods Nesterov introduced in the 80 s an acceleration scheme for the gradient descent algorithm. It was generalized later to the composite setting. FISTA 1 x t argmin x R p 2 (y x t 1 1 ) 2 L f 0(y t 1 ) + 1 L ψ(x); Find α t > 0 s.t. α 2 t = (1 α t )α 2 t 1 + µ L α t; y t x t +β t (x t x t 1 ) with β t = α t 1(1 α t 1 ) αt 1 2 +α. t f(x t ) f = O(1/t 2 ) for convex problems; f(x t ) f = O((1 µ/l) t ) for µ-strongly convex problems; Acceleration works in many practical cases. see [Beck and Teboulle, 2009, Nesterov, 1983, 2004, 2013] 2 Julien Mairal Optim for large-scale ML and sparse estimation 26/71

34 What do we mean by acceleration? Complexity analysis for large finite sums Since f is a sum of n functions, computing f requires computing n gradients f i. The complexity to reach an ε solution is given below Remarks µ > 0 µ = 0 ( ISTA O n L µ log( ) ) 1 ε O ( ) nl ε ( FISTA O n L µ log( 1 ε) ) ( ) L O n ε ε-solution means here f(x t ) f ε. For n = 1, the rates of FISTA are optimal for a first-order local black box [Nesterov, 2004]. For L = 1 and µ = 1/n, scales at best in n 3/2. Julien Mairal Optim for large-scale ML and sparse estimation 27/71

35 How does acceleration work? Unfortunately, the literature does not provide any simple geometric explanation... Julien Mairal Optim for large-scale ML and sparse estimation 28/71

36 How does acceleration work? Unfortunately, the literature does not provide any simple geometric explanation... but they are a few obvious facts and a mechanism introduced by Nesterov, called estimate sequence. Obvious fact Simple gradient descent steps are blind to the past iterates, and are based on a purely local model of the objective. Accelerated methods usually involve an extrapolation step y t = x t +β t (x t x t 1 ) with β t in (0,1). Nesterov interprets acceleration as relying on a better model of the objective called estimate sequence. Julien Mairal Optim for large-scale ML and sparse estimation 28/71

37 How does acceleration work? Definition of estimate sequence [Nesterov]. A pair of sequences (ϕ t ) t 0 and (λ t ) t 0, with λ t 0 and ϕ t : R p R, is called an estimate sequence of function f if λ t 0 and for any x R p and all t 0, ϕ t (x) f(x) λ t (ϕ 0 (x) f(x)). In addition, if for some sequence (x t ) t 0 we have then f(x t ) ϕ t where x is a minimizer of f. = min x R pϕ t(x), f(x t ) F λ t (ϕ 0 (x ) f ), Julien Mairal Optim for large-scale ML and sparse estimation 29/71

38 How does acceleration work? In summary, we need two properties 1 ϕ t (x) (1 λ t )f(x)+λ t ϕ 0 (x); 2 f(x t ) ϕ t = min x R p ϕ t (x). Remarks ϕ t is neither an upper-bound, nor a lower-bound; Finding the right estimate sequence is often nontrivial. Julien Mairal Optim for large-scale ML and sparse estimation 30/71

39 How does acceleration work? In summary, we need two properties 1 ϕ t (x) (1 λ t )f(x)+λ t ϕ 0 (x); 2 f(x t ) ϕ t = min x R p ϕ t (x). How to build an estimate sequence? Define ϕ t recursively ϕ t (x) = (1 α t )ϕ t 1 (x)+α t d t (x), where d t is a lower-bound, e.g., if F is smooth, d t (x) = F(y t )+ F(y t ) (x y t )+ µ 2 x y t 2 2, Then, work hard to choose α t as large as possible, and y t and x t such that property 2 holds. Subsequently, λ t = t t=1 (1 α t). Julien Mairal Optim for large-scale ML and sparse estimation 30/71

40 The stochastic (sub)gradient descent algorithm Consider now the minimization of an expectation min x R pf(x) = E z[l(x,z)], To simplify, we assume that for all z, x l(x,z) is differentiable. Algorithm At iteration t, Randomly draw one example z t from the training set; Update the current iterate x t x t 1 η t f t (x t 1 ) with f t (x) = l(x,z t ). Perform online averaging of the iterates (optional) x t (1 γ t ) x t 1 +γ t x t. Julien Mairal Optim for large-scale ML and sparse estimation 31/71

41 The stochastic (sub)gradient descent algorithm There are various learning rates strategies (constant, varying step-sizes), and averaging strategies. Depending on the problem assumptions and choice of η t, γ t, classical convergence rates may be obtained: f( x t ) f = O(1/ t) for convex problems; f( x t ) f = O(1/t) for strongly-convex ones; Remarks The convergence rates are not great, but the complexity per-iteration is small (1 gradient evaluation for minimizing an empirical risk versus n for the batch algorithm). When the amount of data is infinite, the method minimizes the expected risk (which is what we want). Choosing a good learning rate automatically is an open problem. Julien Mairal Optim for large-scale ML and sparse estimation 32/71

42 Proof of an O(1/ t) rate for the convex case Inspired by (aka, stolen from) F. Bach s slides Assumptions The solution lies in a bounded domain C = { x D}. The sub-gradients are bounded on C: f t (x) B. Fix in advance the number of iterations T and choose η t = 2D B T. Choose Polyak-Ruppert averaging x T = (1/T) T 1 t=0 x t. Perform updates with projections x t Π C [x t 1 η t f t (x t 1 )]. Proposition E[f ( x t ) f ] 2DB T. Julien Mairal Optim for large-scale ML and sparse estimation 33/71

43 Proof of an O(1/ t) rate for the convex case Inspired by (aka, stolen from) F. Bach s slides F t: information up to time t. x D and f t(x) B. Besides E[ f t(x) F t 1] = f(x). x t x 2 x t 1 η t f t(x t 1) x 2 Take now conditional expectations x t 1 x 2 +B 2 η 2 t 2η t(x t 1 x ) f t(x t 1). E[ x t x 2 F t 1] x t 1 x 2 +B 2 η 2 t 2η t(x t 1 x ) f(x t 1) Take now full expectations x t 1 x 2 +B 2 η 2 t 2η t(f(x t 1) f ). E[ x t x 2 ] E[ x t 1 x 2 ]+B 2 η 2 t 2η te[f(x t 1) f ], and, after reorganizing the terms E[f(x t 1) f ] B2 η 2 t η t ( E[ xt 1 x 2 ] E[ x t x 2 ] ). Julien Mairal Optim for large-scale ML and sparse estimation 34/71

44 Proof of an O(1/ t) rate for the convex case Inspired by (aka, stolen from) F. Bach s slides We start again from E[f(x t 1) f ] B2 η 2 t 2 and we exploit the telescopic sum T E[f(x t 1) f ] t=1 T t=1 B 2 η 2 t 2 T B2 η η t ( E[ xt 1 x 2 ] E[ x t x 2 ] ). + T t=1 1 ( E[ xt 1 x 2 ] E[ x t x 2 ] ) 2η t + 4D2 2η 2DB T with γ = 2D B T. Finally, we conclude by using a convexity inequality ( ) T 1 1 Ef f 2DB. T T t=0 Julien Mairal Optim for large-scale ML and sparse estimation 35/71

45 Back to finite sums Consider now the case of interest for us today: 1 n min f i (x), x R p n i=1 Question Can we do as well as SGD in terms of cost per iteration, while enjoying a fast (linear) convergence rate like (accelerated or not) gradient descent? For n = 1, no! The rates are optimal for a first-order local black box [Nesterov, 2004]. For n 1, yes! We need to design algorithms whose per-iteration computational complexity is smaller than n; whose convergence rate may be worse than FISTA......but with a better expected computational complexity. Julien Mairal Optim for large-scale ML and sparse estimation 36/71

46 Incremental gradient descent methods min x R p { f(x) = 1 n } n f i (x). Several randomized algorithms are designed with one f i computed per iteration, with fast convergence rates, e.g., SAG [Schmidt et al., 2013]: x k x k 1 γ n yi k with yi k f i (x k 1 ) if i = i k = Ln i=1 yi k 1 otherwise. i=1 Julien Mairal Optim for large-scale ML and sparse estimation 37/71

47 Incremental gradient descent methods min x R p { f(x) = 1 n } n f i (x). Several randomized algorithms are designed with one f i computed per iteration, with fast convergence rates, e.g., SAG [Schmidt et al., 2013]: x k x k 1 γ n yi k with yi k f i (x k 1 ) if i = i k = Ln i=1 yi k 1 otherwise. i=1 See also SVRG, SAGA, SDCA, MISO, Finito... Some of these algorithms perform updates of the form x k x k 1 η k g k with E[g k ] = f(x k 1 ), but g k has lower variance than in SGD. [Schmidt et al., 2013, Xiao and Zhang, 2014, Defazio et al., 2014a,b, Shalev-Shwartz and Zhang, 2012, Mairal, 2015, Zhang and Xiao, 2015] Julien Mairal Optim for large-scale ML and sparse estimation 37/71

48 Incremental gradient descent methods These methods achieve low (worst-case) complexity in expectation. The number of gradients evaluations to ensure f(x k ) f ε is FISTA SVRG, SAG, SAGA, SDCA, MISO, Finito O µ > 0 ( O n L µ log( ) ) 1 ε ( ( ) max n, L µ log ( ) ) 1 ε Julien Mairal Optim for large-scale ML and sparse estimation 38/71

49 Incremental gradient descent methods These methods achieve low (worst-case) complexity in expectation. The number of gradients evaluations to ensure f(x k ) f ε is FISTA SVRG, SAG, SAGA, SDCA, MISO, Finito O µ > 0 ( O n L µ log( ) ) 1 ε ( ( ) max n, L µ log ( ) ) 1 ε Main features vs. stochastic gradient descent Same complexity per-iteration (but higher memory footprint). Faster convergence (exploit the finite-sum structure). Less parameter tuning than SGD. Some variants are compatible with a composite term ψ. Julien Mairal Optim for large-scale ML and sparse estimation 38/71

50 Incremental gradient descent methods These methods achieve low (worst-case) complexity in expectation. The number of gradients evaluations to ensure f(x k ) f ε is FISTA SVRG, SAG, SAGA, SDCA, MISO, Finito O µ > 0 ( O n L µ log( ) ) 1 ε ( ( ) max n, L µ log ( ) ) 1 ε Important remarks When f i (x) = l(zi x), the memory footprint is O(n) otherwise O(dn), except for SVRG (O(d)). Some algorithms require an estimate of µ; L is the average (or max) of the Lipschitz constants of the f i s. The L for fista is the Lipschitz constant of f: L L. Julien Mairal Optim for large-scale ML and sparse estimation 38/71

51 Incremental gradient descent methods stealing again a bit from F. Bach s slides. Variance reduction Consider two random variables X, Y and define Then, E[Z] = E[X] Z = X Y +E[Y]. Var(Z) = Var(X)+Var(Y) 2cov(X,Y). The variance of Z may be smaller if X and Y are positively correlated. Julien Mairal Optim for large-scale ML and sparse estimation 39/71

52 Incremental gradient descent methods stealing again a bit from F. Bach s slides. Variance reduction Consider two random variables X, Y and define Then, E[Z] = E[X] Z = X Y +E[Y]. Var(Z) = Var(X)+Var(Y) 2cov(X,Y). The variance of Z may be smaller if X and Y are positively correlated. Why is it useful for stochatic optimization? step-sizes for SGD have to decrease to ensure convergence. with variance reduction, one may use constant step-sizes. Julien Mairal Optim for large-scale ML and sparse estimation 39/71

53 Incremental gradient descent methods SVRG x t = x t 1 γ( f it (x t 1 ) f it (y)+ f(y)), where y is updated every epoch and E[ f it (y) F t 1 ] = f(y). SAGA x t = x t 1 γ ( f it (x t 1 ) y t 1 i t where E[yi t 1 t F t 1 ] = 1 n n i=1 yt 1 i and yi t = + 1 n n ) i=1 yt 1 i, f i (x t 1 ) y t 1 i if i = i t otherwise. MISO/Finito: for n L/µ, same form as SAGA but 1 n n i=1 yt 1 i = µx t 1 and yi t = f i (x t 1 ) µx t 1 y t 1 i if i = i t otherwise. Julien Mairal Optim for large-scale ML and sparse estimation 40/71

54 Can we do even better for large finite sums? Without vs with acceleration FISTA SVRG, SAG, SAGA, SDCA, MISO, Finito Accelerated versions µ > 0 ( O n L µ log( ) ) 1 ε ( ( ) O max n, L µ log ( ) ) 1 ε ( ( ) Õ max n, n L µ log ( ) ) 1 ε Acceleration for specific algorithms [Shalev-Shwartz and Zhang, 2014, Lan, 2015, Allen-Zhu, 2016]. Generic acceleration: Catalyst [Lin et al., 2015]. see [Agarwal and Bottou, 2015] for discussions about optimality. Julien Mairal Optim for large-scale ML and sparse estimation 41/71

55 What we have not (or should have) covered Import approaches and concepts distributed optimization. proximal splitting / ADMM. Quasi-Newton approaches. Julien Mairal Optim for large-scale ML and sparse estimation 42/71

56 What we have not (or should have) covered Import approaches and concepts distributed optimization. proximal splitting / ADMM. Quasi-Newton approaches. The question Should we care that much about minimizing finite sums when all we want is minimizing an expectation? Julien Mairal Optim for large-scale ML and sparse estimation 42/71

57 Part II: Sparse estimation Julien Mairal Optim for large-scale ML and sparse estimation 43/71

58 Chronological overview of parsimony 14th century: Ockham s razor; 1921: Wrinch and Jeffreys simplicity principle; 1952: Markowitz s portfolio selection; 60 and 70 s: best subset selection in statistics; 70 s: use of the l 1 -norm for signal recovery in geophysics; 90 s: wavelet thresholding in signal processing; 1996: Olshausen and Field s dictionary learning; : Lasso (statistics) and basis pursuit (signal processing); 2006: compressed sensing (signal processing) and Lasso consistency (statistics); 2006 now: applications of dictionary learning in various scientific fields such as image processing and computer vision. Julien Mairal Optim for large-scale ML and sparse estimation 44/71

59 Sparsity in the statistics literature from the 60 s and 70 s How to choose k? Mallows s C p statistics [Mallows, 1964, 1966]; Akaike information criterion (AIC) [Akaike, 1973]; Bayesian information critertion (BIC) [Schwarz, 1978]; Minimum description length (MDL) [Rissanen, 1978]. These approaches lead to penalized problems min θ R pl(θ)+λ θ 0, with different choices of λ depending on the chosen criterion. Julien Mairal Optim for large-scale ML and sparse estimation 45/71

60 Sparsity in the statistics literature from the 60 s and 70 s How to solve the best k-subset selection problem? Unfortunately......the problem is NP-hard [Natarajan, 1995]. Two strategies combinatorial exploration with branch-and-bound techniques [Furnival and Wilson, 1974] leaps and bounds, exact algorithm but exponential complexity; greedy approach: forward selection [Efroymson, 1960] (originally developed for observing intermediate solutions), already contains all the ideas of matching pursuit algorithms. Important reference: [Hocking, 1976]. The analysis and selection of variables in linear regression. Biometrics. Julien Mairal Optim for large-scale ML and sparse estimation 46/71

61 Wavelet thresholding in signal processing from the 90 s Wavelets where the topic of a long quest for representing natural images 2D-Gabors [Daugman, 1985]; steerable wavelets [Simoncelli et al., 1992]; curvelets [Candès and Donoho, 2002]; countourlets [Do and Vertterli, 2003]; bandlets [Le Pennec and Mallat, 2005]; -lets (joke). (a) 2D Gabor filter. (b) With shifted phase. (c) With rotation. Julien Mairal Optim for large-scale ML and sparse estimation 47/71

62 Wavelet thresholding in signal processing from 90 s The theory of wavelets is well developed for continuous signals, e.g., in L 2 (R), but also for discrete signals x in R n. Julien Mairal Optim for large-scale ML and sparse estimation 48/71

63 Wavelet thresholding in signal processing from 90 s Given an orthogonal wavelet basis D = [d 1,...,d n ] in R n n, the wavelet decomposition of x in R n is simply β = D x and we have x = Dβ. The k-sparse approximation problem 1 min α R p 2 x Dα 2 2 s.t. α 0 k, is not NP-hard here: since D is orthogonal, it is equivalent to 1 min α R p 2 β α 2 2 s.t. α 0 k. Julien Mairal Optim for large-scale ML and sparse estimation 49/71

64 Wavelet thresholding in signal processing from 90 s Given an orthogonal wavelet basis D = [d 1,...,d n ] in R n n, the wavelet decomposition of x in R n is simply β = D x and we have x = Dβ. The k-sparse approximation problem 1 min α R p 2 x Dα 2 2 s.t. α 0 k, The solution is obtained by hard-thresholding: α ht β[j] if β[j] µ [j] = δ β[j] µ β[j] = 0 otherwise, where µ the k-th largest value among the set { β[1],..., β[p] }. Julien Mairal Optim for large-scale ML and sparse estimation 49/71

65 Wavelet thresholding in signal processing, 90 s Another key operator is the soft-thresholding operator [see Donoho and Johnstone, 1994] : β[j] λ if β[j] λ α st [j] = sign(β[j])max( β[j] λ,0) = β[j]+λ if β[j] λ 0 otherwise, where λ is a parameter playing the same role as µ previously. With β = D x and D orthogonal, it provides the solution of the following sparse reconstruction problem: 1 min α R p 2 x Dα 2 2 +λ α 1, which will be of high importance later. Julien Mairal Optim for large-scale ML and sparse estimation 50/71

66 Wavelet thresholding in signal processing, 90 s α st α ht λ λ β µ µ β (d) Soft-thresholding operator, α st = sign(β)max( β λ,0). (e) Hard-thresholding operator α ht = δ β µ β. Figure: Soft- and hard-thresholding operators, which are commonly used for signal estimation with orthogonal wavelet basis. Julien Mairal Optim for large-scale ML and sparse estimation 51/71

67 The modern parsimony and the l 1 -norm Sparse linear models in signal processing Let x in R n be a signal. Let D = [d 1,...,d p ] R n p be a set of elementary signals. We call it dictionary. D is adapted to x if it can represent it with a few elements that is, there exists a sparse vector α in R p such that x Dα. We call α the sparse code. α[1] x }{{} x R n d 1 d 2 d p α[2]. }{{ D R n p } α[p] }{{} α R p,sparse Julien Mairal Optim for large-scale ML and sparse estimation 52/71

68 The modern parsimony and the l 1 -norm Sparse linear models: machine learning/statistics point of view A few examples: Ridge regression: Linear SVM: 1 min β R p n 1 min β R p n 1 Logistic regression: min β R p n n 1 2 (y i β x i ) 2 +λ β 2 2. n max(0,1 y i β x i )+λ β 2 2. i=1 i=1 n i=1 ) log (1+e y iβ x i +λ β 2 2. Julien Mairal Optim for large-scale ML and sparse estimation 53/71

69 The modern parsimony and the l 1 -norm Sparse linear models: machine learning/statistics point of view A few examples: Ridge regression: Linear SVM: 1 min β R p n 1 min β R p n 1 Logistic regression: min β R p n n 1 2 (y i β x i ) 2 +λ β 2 2. n max(0,1 y i β x i )+λ β 2 2. i=1 i=1 n i=1 ) log (1+e y iβ x i +λ β 2 2. The squared l 2 -norm induces smoothness in β. When one knows in advance that β should be sparse, one should use a sparsity-inducing regularization such as the l 1 -norm. [Chen et al., 1999, Tibshirani, 1996] Julien Mairal Optim for large-scale ML and sparse estimation 54/71

70 The modern parsimony and the l 1 -norm Why does the l 1 -norm induce sparsity? Can we get some intuition from the simplest isotropic case? 1 ˆα(λ) = argmin α R p 2 x α 2 2 +λ α 1, or equivalently the Euclidean projection onto the l 1 -ball? 1 α(µ) = argmin α R p 2 x α 2 2 s.t. α 1 µ. equivalent means that for all λ > 0, there exists µ 0 such that α(µ) = ˆα(λ). Julien Mairal Optim for large-scale ML and sparse estimation 55/71

71 The modern parsimony and the l 1 -norm Why does the l 1 -norm induce sparsity? Can we get some intuition from the simplest isotropic case? 1 ˆα(λ) = argmin α R p 2 x α 2 2 +λ α 1, or equivalently the Euclidean projection onto the l 1 -ball? 1 α(µ) = argmin α R p 2 x α 2 2 s.t. α 1 µ. equivalent means that for all λ > 0, there exists µ 0 such that α(µ) = ˆα(λ). The relation between µ and λ is unknown a priori. Julien Mairal Optim for large-scale ML and sparse estimation 55/71

72 Why does the l 1 -norm induce sparsity? Regularizing with the l 1-norm α[2] l 1 -ball α[1] α 1 µ The projection onto a convex set is biased towards singularities. Julien Mairal Optim for large-scale ML and sparse estimation 56/71

73 Why does the l 1 -norm induce sparsity? Regularizing with the l 2-norm α[2] l 2 -ball α[1] α 2 µ The l 2 -norm is isotropic. Julien Mairal Optim for large-scale ML and sparse estimation 57/71

74 Why does the l 1 -norm induce sparsity? In 3D. (images produced by G. Obozinski) Julien Mairal Optim for large-scale ML and sparse estimation 58/71

75 Why does the l 1 -norm induce sparsity? Regularizing with the l -norm α[2] l -ball α[1] α µ The l -norm encourages α[1] = α[2]. Julien Mairal Optim for large-scale ML and sparse estimation 59/71

76 Why does the l 1 -norm induce sparsity? Analytical point of view: 1D case 1 min α R 2 (x α)2 +λ α Piecewise quadratic function with a kink at zero. Derivative at 0 + : g + = x+λ and 0 : g = x λ. Optimality conditions. α is optimal iff: α > 0 and (x α)+λsign(α) = 0 α = 0 and g + 0 and g 0 The solution is a soft-thresholding: α = sign(x)( x λ) +. Julien Mairal Optim for large-scale ML and sparse estimation 60/71

77 Why does the l 1 -norm induce sparsity? Comparison with l 2-regularization in 1D ψ(α) = α 2 ψ(α) = α ψ (α) = 2α ψ (α) = 1, ψ +(α) = 1 The gradient of the l 2 -penalty vanishes when α get close to 0. On its differentiable part, the norm of the gradient of the l 1 -norm is constant. Julien Mairal Optim for large-scale ML and sparse estimation 61/71

78 Why does the l 1 -norm induce sparsity? Physical illustration E 1 = 0 E 1 = 0 x Julien Mairal Optim for large-scale ML and sparse estimation 62/71

79 Why does the l 1 -norm induce sparsity? Physical illustration E 1 = k 1 2 (x α) 2 E 1 = k 1 2 (x α) 2 E 2 = k 2 2 α 2 α α E 2 = mgα Julien Mairal Optim for large-scale ML and sparse estimation 62/71

80 Why does the l 1 -norm induce sparsity? Physical illustration E 1 = k 1 2 (x α) 2 E 1 = k 1 2 (x α) 2 E 2 = k 2 2 α 2 α α = 0!! E 2 = mgα Julien Mairal Optim for large-scale ML and sparse estimation 62/71

81 Why does the l 1 -norm induce sparsity? coefficient values w 1 w 2 w 3 w 4 w λ Figure: The regularization path of the Lasso. 1 min α R p 2 x Dα 2 2 +λ α 1. Julien Mairal Optim for large-scale ML and sparse estimation 63/71

82 Non-convex sparsity-inducing penalties Exploiting concave functions with a kink at zero ψ(α) = p j=1 ϕ( α[j] ). l q -penalty, with 0 < q < 1: ψ(α) = p j=1 α[j] q, [Frank and Friedman, 1993]; log penalty, ψ(α) = p j=1 log( α[j] +ε). ϕ is any function that looks like this: ϕ( α ) α Julien Mairal Optim for large-scale ML and sparse estimation 64/71

83 Non-convex sparsity-inducing penalties (a) l 0.5-ball, 2-D (b) l 1-ball, 2-D (c) l 2-ball, 2-D Figure: Open balls in 2-D corresponding to several l q -norms and pseudo-norms. Julien Mairal Optim for large-scale ML and sparse estimation 65/71

84 Non-convex sparsity-inducing penalties α[2] α q µ with q < 1 l q -ball α[1] Julien Mairal Optim for large-scale ML and sparse estimation 66/71

85 Elastic-net The elastic net introduced by [Zou and Hastie, 2005] ψ(α) = α 1 +γ α 2 2, The penalty provides more stable (but less sparse) solutions. (a) l 1-ball, 2-D (b) elastic-net, 2-D (c) l 2-ball, 2-D Julien Mairal Optim for large-scale ML and sparse estimation 67/71

86 The elastic-net vs other penalties α[2] α q µ with q < 1 l q -ball α[1] Julien Mairal Optim for large-scale ML and sparse estimation 68/71

87 The elastic-net vs other penalties α[2] l 1 -ball α[1] α 1 µ Julien Mairal Optim for large-scale ML and sparse estimation 68/71

88 The elastic-net vs other penalties α[2] (1 γ) α 1 +γ α 2 2 µ elastic-net ball α[1] Julien Mairal Optim for large-scale ML and sparse estimation 68/71

89 The elastic-net vs other penalties α[2] l 2 -ball α[1] α 2 2 µ Julien Mairal Optim for large-scale ML and sparse estimation 68/71

90 Structured sparsity images produced by G. Obozinski Julien Mairal Optim for large-scale ML and sparse estimation 69/71

91 Structured sparsity images produced by G. Obozinski Julien Mairal Optim for large-scale ML and sparse estimation 70/71

92 Mark the date! July 2-6th, Grenoble Along with Naver Labs, Inria is organizing a summer school in Grenoble on artificial intelligence. Visit Among the distinguished speakers Lourdes Agapito (UCL) Kyunghyun Cho (NYU/Facebook) Emmanuel Dupoux (EHESS) Martial Hebert (CMU) Hugo Larochelle (Google Brain) Yann LeCun (Facebook/NYU) Jean Ponce (Inria) Cordelia Schmid (Inria) Andrew Zisserman (Oxford/Google DeepMind).... Julien Mairal Optim for large-scale ML and sparse estimation 71/71

93 References I A. Agarwal and L. Bottou. A lower bound for the optimization of finite sums. In Proceedings of the International Conference on Machine Learning (ICML), H. Akaike. Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory, volume 1, pages , Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. arxiv preprint arxiv: , A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1): , Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. The Journal of Machine Learning Research, 14 (1): , Julien Mairal Optim for large-scale ML and sparse estimation 72/71

94 References II Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. arxiv preprint arxiv: , E. J. Candès and D. L. Donoho. Recovering edges in ill-posed inverse problems: Optimality of curvelet frames. Annals of Statistics, 30(3): , Antonin Chambolle and Jérôme Darbon. On total variation minimization and surface evolution using parametric maximum flows. International journal of computer vision, 84(3):288, S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20:33 61, P. L. Combettes and V. R. Wajs. Signal recovery by proximal forward-backward splitting. SIAM Multiscale Modeling and Simulation, 4(4): , David Corfield, Bernhard Schölkopf, and Vladimir Vapnik. Falsificationism and statistical learning theory: Comparing the popper and vapnik-chervonenkis dimensions. Journal for General Philosophy of Science, 40(1):51 58, I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics, 57(11): , Julien Mairal Optim for large-scale ML and sparse estimation 73/71

95 References III J. G. Daugman. Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical filters. Journal of the Optical Society of America A, 2(7): , A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems (NIPS), 2014a. A. J. Defazio, T. S. Caetano, and J. Domke. Finito: A faster, permutable incremental gradient method for big data problems. In Proceedings of the International Conference on Machine Learning (ICML), 2014b. M. Do and M. Vertterli. Contourlets, Beyond Wavelets. Academic Press, D. L. Donoho and J. M. Johnstone. Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81(3): , M. A. Efroymson. Multiple regression analysis. Mathematical methods for digital computers, 9(1): , I. E Frank and J. H. Friedman. A statistical view of some chemometrics regression tools. Technometrics, 35(2): , Julien Mairal Optim for large-scale ML and sparse estimation 74/71

96 References IV G. M. Furnival and R. W. Wilson. Regressions by leaps and bounds. Technometrics, 16(4): , R. R. Hocking. A Biometrics invited paper. The analysis and selection of variables in linear regression. Biometrics, 32:1 49, H. Hoefling. A path algorithm for the fused lasso signal approximator. Journal of Computational and Graphical Statistics, 19(4): , R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for hierarchical sparse coding. Journal of Machine Learning Research, 12: , Guanghui Lan. An optimal randomized incremental gradient method. arxiv preprint arxiv: , E. Le Pennec and S. Mallat. Sparse geometric image representations with bandelets. IEEE Transactions on Image Processing, 14(4): , H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems, Julien Mairal Optim for large-scale ML and sparse estimation 75/71

97 References V J. Mairal. Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM Journal on Optimization, 25(2): , J. Mairal, R. Jenatton, G. Obozinski, and F. Bach. Network flow algorithms for structured sparsity. In Advances in Neural Information Processing Systems (NIPS), C. L. Mallows. Choosing variables in a linear regression: A graphical aid. unpublished paper presented at the Central Regional Meeting of the Institute of Mathematical Statistics, Manhattan, Kansas, C. L. Mallows. Choosing a subset regression. unpublished paper presented at the Joint Statistical Meeting, Los Angeles, California, B. K. Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on Computing, 24: , Y. Nesterov. Introductory lectures on convex optimization: a basic course. Kluwer Academic Publishers, Y. Nesterov. Gradient methods for minimizing composite objective function. Mathematical Programming, 140(1): , Julien Mairal Optim for large-scale ML and sparse estimation 76/71

98 References VI Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o (1/k2). In Doklady an SSSR, volume 269, pages , R. D. Nowak and M. A. T. Figueiredo. Fast wavelet-based image deconvolution using the EM algorithm. In Conference Record of the Thirty-Fifth Asilomar Conference on Signals, Systems and Computers., B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381: , J. Rissanen. Modeling by shortest data description. Automatica, 14(5): , M. Schmidt, N. Le Roux, and F. Bach. Minimizing finite sums with the stochastic average gradient. arxiv: , Bernhard Schölkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. On causal and anticausal learning. arxiv preprint arxiv: , Julien Mairal Optim for large-scale ML and sparse estimation 77/71

99 References VII G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6(2): , S. Shalev-Shwartz and T. Zhang. Proximal stochastic dual coordinate ascent. arxiv: , S. Shalev-Shwartz and T. Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming, pages 1 41, E. P. Simoncelli, W. T. Freeman, E. H. Adelson, and D. J. Heeger. Shiftable multiscale transforms. IEEE Transactions on Information Theory, 38(2): , R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society: Series B, 58(1): , Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, S.J. Wright, R.D. Nowak, and M.A.T. Figueiredo. Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing, 57(7): , Julien Mairal Optim for large-scale ML and sparse estimation 78/71

100 References VIII D. Wrinch and H. Jeffreys. XLII. On certain fundamental principles of scientific inquiry. Philosophical Magazine Series 6, 42(249): , L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4): , Y. Zhang and L. Xiao. Stochastic primal-dual coordinate method for regularized empirical risk minimization. In Proceedings of the International Conference on Machine Learning (ICML), H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B, 67(2): , Julien Mairal Optim for large-scale ML and sparse estimation 79/71

Introduction to Three Paradigms in Machine Learning. Julien Mairal

Introduction to Three Paradigms in Machine Learning. Julien Mairal Introduction to Three Paradigms in Machine Learning Julien Mairal Inria Grenoble Yerevan, 208 Julien Mairal Introduction to Three Paradigms in Machine Learning /25 Optimization is central to machine learning

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

A Universal Catalyst for Gradient-Based Optimization

A Universal Catalyst for Gradient-Based Optimization A Universal Catalyst for Gradient-Based Optimization Julien Mairal Inria, Grenoble CIMI workshop, Toulouse, 2015 Julien Mairal, Inria Catalyst 1/58 Collaborators Hongzhou Lin Zaid Harchaoui Publication

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Large-Scale Machine Learning and Applications

Large-Scale Machine Learning and Applications Large-Scale Machine Learning and Applications Soutenance pour l habilitation à diriger des recherches Julien Mairal Jury: Pr. Léon Bottou NYU & Facebook Rapporteur Pr. Mário Figueiredo IST, Univ. Lisbon

More information

Recent Advances in Structured Sparse Models

Recent Advances in Structured Sparse Models Recent Advances in Structured Sparse Models Julien Mairal Willow group - INRIA - ENS - Paris 21 September 2010 LEAR seminar At Grenoble, September 21 st, 2010 Julien Mairal Recent Advances in Structured

More information

Stochastic optimization: Beyond stochastic gradients and convexity Part I

Stochastic optimization: Beyond stochastic gradients and convexity Part I Stochastic optimization: Beyond stochastic gradients and convexity Part I Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint tutorial with Suvrit Sra, MIT - NIPS

More information

Topographic Dictionary Learning with Structured Sparsity

Topographic Dictionary Learning with Structured Sparsity Topographic Dictionary Learning with Structured Sparsity Julien Mairal 1 Rodolphe Jenatton 2 Guillaume Obozinski 2 Francis Bach 2 1 UC Berkeley 2 INRIA - SIERRA Project-Team San Diego, Wavelets and Sparsity

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19 Journal Club A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) CMAP, Ecole Polytechnique March 8th, 2018 1/19 Plan 1 Motivations 2 Existing Acceleration Methods 3 Universal

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where

More information

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Julien Mairal Inria, LEAR Team, Grenoble Journées MAS, Toulouse Julien Mairal Incremental and Stochastic

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Sparse Estimation and Dictionary Learning

Sparse Estimation and Dictionary Learning Sparse Estimation and Dictionary Learning (for Biostatistics?) Julien Mairal Biostatistics Seminar, UC Berkeley Julien Mairal Sparse Estimation and Dictionary Learning Methods 1/69 What this talk is about?

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

ONLINE VARIANCE-REDUCING OPTIMIZATION

ONLINE VARIANCE-REDUCING OPTIMIZATION ONLINE VARIANCE-REDUCING OPTIMIZATION Nicolas Le Roux Google Brain nlr@google.com Reza Babanezhad University of British Columbia rezababa@cs.ubc.ca Pierre-Antoine Manzagol Google Brain manzagop@google.com

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)

More information

OWL to the rescue of LASSO

OWL to the rescue of LASSO OWL to the rescue of LASSO IISc IBM day 2018 Joint Work R. Sankaran and Francis Bach AISTATS 17 Chiranjib Bhattacharyya Professor, Department of Computer Science and Automation Indian Institute of Science,

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale

More information

Stochastic gradient descent and robustness to ill-conditioning

Stochastic gradient descent and robustness to ill-conditioning Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July

More information

Backpropagation Rules for Sparse Coding (Task-Driven Dictionary Learning)

Backpropagation Rules for Sparse Coding (Task-Driven Dictionary Learning) Backpropagation Rules for Sparse Coding (Task-Driven Dictionary Learning) Julien Mairal UC Berkeley Edinburgh, ICML, June 2012 Julien Mairal, UC Berkeley Backpropagation Rules for Sparse Coding 1/57 Other

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - September 2012 Context Machine

More information

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org

More information

arxiv: v1 [math.oc] 18 Mar 2016

arxiv: v1 [math.oc] 18 Mar 2016 Katyusha: Accelerated Variance Reduction for Faster SGD Zeyuan Allen-Zhu zeyuan@csail.mit.edu Princeton University arxiv:1603.05953v1 [math.oc] 18 Mar 016 March 18, 016 Abstract We consider minimizing

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Linear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1

Linear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1 Linear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1 ( OWL ) Regularization Mário A. T. Figueiredo Instituto de Telecomunicações and Instituto Superior Técnico, Universidade de

More information

Lecture 2 February 25th

Lecture 2 February 25th Statistical machine learning and convex optimization 06 Lecture February 5th Lecturer: Francis Bach Scribe: Guillaume Maillard, Nicolas Brosse This lecture deals with classical methods for convex optimization.

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Finite-sum Composition Optimization via Variance Reduced Gradient Descent Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu

More information

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization / Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

Hierarchical kernel learning

Hierarchical kernel learning Hierarchical kernel learning Francis Bach Willow project, INRIA - Ecole Normale Supérieure May 2010 Outline Supervised learning and regularization Kernel methods vs. sparse methods MKL: Multiple kernel

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine

More information

Structured sparsity-inducing norms through submodular functions

Structured sparsity-inducing norms through submodular functions Structured sparsity-inducing norms through submodular functions Francis Bach Sierra team, INRIA - Ecole Normale Supérieure - CNRS Thanks to R. Jenatton, J. Mairal, G. Obozinski June 2011 Outline Introduction:

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Lecture 3: Minimizing Large Sums. Peter Richtárik

Lecture 3: Minimizing Large Sums. Peter Richtárik Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors

More information

Parcimonie en apprentissage statistique

Parcimonie en apprentissage statistique Parcimonie en apprentissage statistique Guillaume Obozinski Ecole des Ponts - ParisTech Journée Parcimonie Fédération Charles Hermite, 23 Juin 2014 Parcimonie en apprentissage 1/44 Classical supervised

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning

More information

An Homotopy Algorithm for the Lasso with Online Observations

An Homotopy Algorithm for the Lasso with Online Observations An Homotopy Algorithm for the Lasso with Online Observations Pierre J. Garrigues Department of EECS Redwood Center for Theoretical Neuroscience University of California Berkeley, CA 94720 garrigue@eecs.berkeley.edu

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Sparse Covariance Selection using Semidefinite Programming

Sparse Covariance Selection using Semidefinite Programming Sparse Covariance Selection using Semidefinite Programming A. d Aspremont ORFE, Princeton University Joint work with O. Banerjee, L. El Ghaoui & G. Natsoulis, U.C. Berkeley & Iconix Pharmaceuticals Support

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

Nesterov s Acceleration

Nesterov s Acceleration Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

Convex relaxation for Combinatorial Penalties

Convex relaxation for Combinatorial Penalties Convex relaxation for Combinatorial Penalties Guillaume Obozinski Equipe Imagine Laboratoire d Informatique Gaspard Monge Ecole des Ponts - ParisTech Joint work with Francis Bach Fête Parisienne in Computation,

More information

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT MLCC 2018 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net

More information

LASSO Review, Fused LASSO, Parallel LASSO Solvers

LASSO Review, Fused LASSO, Parallel LASSO Solvers Case Study 3: fmri Prediction LASSO Review, Fused LASSO, Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 3, 2016 Sham Kakade 2016 1 Variable

More information

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss:

More information

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation Zhouchen Lin Peking University April 22, 2018 Too Many Opt. Problems! Too Many Opt. Algorithms! Zero-th order algorithms:

More information

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 08: Sparsity Based Regularization Lorenzo Rosasco Learning algorithms so far ERM + explicit l 2 penalty 1 min w R d n n l(y

More information

A tutorial on sparse modeling. Outline:

A tutorial on sparse modeling. Outline: A tutorial on sparse modeling. Outline: 1. Why? 2. What? 3. How. 4. no really, why? Sparse modeling is a component in many state of the art signal processing and machine learning tasks. image processing

More information

Sparsity Regularization

Sparsity Regularization Sparsity Regularization Bangti Jin Course Inverse Problems & Imaging 1 / 41 Outline 1 Motivation: sparsity? 2 Mathematical preliminaries 3 l 1 solvers 2 / 41 problem setup finite-dimensional formulation

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

arxiv: v1 [math.oc] 10 Oct 2018

arxiv: v1 [math.oc] 10 Oct 2018 8 Frank-Wolfe Method is Automatically Adaptive to Error Bound ondition arxiv:80.04765v [math.o] 0 Oct 08 Yi Xu yi-xu@uiowa.edu Tianbao Yang tianbao-yang@uiowa.edu Department of omputer Science, The University

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Learning MMSE Optimal Thresholds for FISTA

Learning MMSE Optimal Thresholds for FISTA MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Learning MMSE Optimal Thresholds for FISTA Kamilov, U.; Mansour, H. TR2016-111 August 2016 Abstract Fast iterative shrinkage/thresholding algorithm

More information

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Jinghui Chen Department of Systems and Information Engineering University of Virginia Quanquan Gu

More information

Stochastic Optimization: First order method

Stochastic Optimization: First order method Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO

More information

Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization

Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization Xiaocheng Tang Department of Industrial and Systems Engineering Lehigh University Bethlehem, PA 18015 xct@lehigh.edu Katya Scheinberg

More information

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization References --- a tentative list of papers to be mentioned in the ICML 2017 tutorial Recent Advances in Stochastic Convex and Non-Convex Optimization Disclaimer: in a quite arbitrary order. 1. [ShalevShwartz-Zhang,

More information

... SPARROW. SPARse approximation Weighted regression. Pardis Noorzad. Department of Computer Engineering and IT Amirkabir University of Technology

... SPARROW. SPARse approximation Weighted regression. Pardis Noorzad. Department of Computer Engineering and IT Amirkabir University of Technology ..... SPARROW SPARse approximation Weighted regression Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Université de Montréal March 12, 2012 SPARROW 1/47 .....

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based

More information

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu LITIS - EA 48 - INSA/Universite de Rouen Avenue de l Université - 768 Saint-Etienne du Rouvray

More information

Lecture 23: November 21

Lecture 23: November 21 10-725/36-725: Convex Optimization Fall 2016 Lecturer: Ryan Tibshirani Lecture 23: November 21 Scribes: Yifan Sun, Ananya Kumar, Xin Lu Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

Modern Optimization Techniques

Modern Optimization Techniques Modern Optimization Techniques 2. Unconstrained Optimization / 2.2. Stochastic Gradient Descent Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University

More information

Linearly-Convergent Stochastic-Gradient Methods

Linearly-Convergent Stochastic-Gradient Methods Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure

More information

DATA MINING AND MACHINE LEARNING

DATA MINING AND MACHINE LEARNING DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

1 Sparsity and l 1 relaxation

1 Sparsity and l 1 relaxation 6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the

More information

SPARSE SIGNAL RESTORATION. 1. Introduction

SPARSE SIGNAL RESTORATION. 1. Introduction SPARSE SIGNAL RESTORATION IVAN W. SELESNICK 1. Introduction These notes describe an approach for the restoration of degraded signals using sparsity. This approach, which has become quite popular, is useful

More information

Structured Sparse Estimation with Network Flow Optimization

Structured Sparse Estimation with Network Flow Optimization Structured Sparse Estimation with Network Flow Optimization Julien Mairal University of California, Berkeley Neyman seminar, Berkeley Julien Mairal Neyman seminar, UC Berkeley /48 Purpose of the talk introduce

More information