Optimization methods for large-scale machine learning and sparse estimation. Julien Mairal

Size: px

Start display at page:

Download "Optimization methods for large-scale machine learning and sparse estimation. Julien Mairal"

Lauren Barton
5 years ago
Views:

1 Optimization methods for large-scale machine learning and sparse estimation Julien Mairal Inria Grenoble Nantes, Mascot-Num, 2018 Part II Julien Mairal Optim for large-scale ML and sparse estimation 1/71

2 Common paradigm: optimization for machine learning Optimization is central to machine learning. For instance, in supervised learning, the goal is to learn a prediction function f : X Y given labeled training data (x i,y i ) i=1,...,n with x i in X, and y i in Y: min f F 1 n L(y i,f(x i )) n i=1 }{{} empirical risk, data fit + λω(f). }{{} regularization [Vapnik, 1995, Bottou, Curtis, and Nocedal, 2016]... Julien Mairal Optim for large-scale ML and sparse estimation 2/71

3 Paradigm 3: The sparsity principle Let us consider again the classical scientific paradigm: 1 observe the world (gather data); 2 propose models of the world (design and learn); 3 test on new data (estimate the generalization error). [Corfield et al., 2009]. Julien Mairal Optim for large-scale ML and sparse estimation 3/71

4 Paradigm 3: The sparsity principle Let us consider again the classical scientific paradigm: 1 observe the world (gather data); 2 propose models of the world (design and learn); 3 test on new data (estimate the generalization error). But... it is not always possible to distinguish the generalization error of various models based on available data. when a complex model A performs slightly better than a simple model B, should we prefer A or B? generalization error requires a predictive task: what about unsupervised learning? which measure should we use? we are also leaving aside the problem of non i.i.d. train/test data, biased data, testing with counterfactual reasoning... [Corfield et al., 2009, Bottou et al., 2013, Schölkopf et al., 2012]. Julien Mairal Optim for large-scale ML and sparse estimation 3/71

Paradigm 3: The sparsity principle (a) Dorothy Wrinch 1894 1980 (b) Harold Jeffreys 1891 1989 The

may infer that it is justifiable to prefer a simple law to a more complex one that fits our observations

5 Paradigm 3: The sparsity principle (a) Dorothy Wrinch (b) Harold Jeffreys The existence of simple laws is, then, apparently, to be regarded as a quality of nature; and accordingly we may infer that it is justifiable to prefer a simple law to a more complex one that fits our observations slightly better. [Wrinch and Jeffreys, 1921]. Julien Mairal Optim for large-scale ML and sparse estimation 4/71

6 Paradigm 3: The sparsity principle Remarks: sparsity is... appealing for experimental sciences for model interpretation; (too-)well understood in some mathematical contexts: 1 min w R p n n i=1 ) L (y i,w x i } {{ } empirical risk, data fit + λ w }{{} 1. regularization extremely powerful for unsupervised learning in the context of matrix factorization, and simple to use. [Olshausen and Field, 1996, Chen, Donoho, and Saunders, 1999, Tibshirani, 1996]... Julien Mairal Optim for large-scale ML and sparse estimation 5/71

7 Paradigm 3: The sparsity principle Remarks: sparsity is... appealing for experimental sciences for model interpretation; (too-)well understood in some mathematical contexts: 1 min w R p n n i=1 ) L (y i,w x i } {{ } empirical risk, data fit + λ w }{{} 1. regularization extremely powerful for unsupervised learning in the context of matrix factorization, and simple to use. Today s challenges Develop sparse and stable (and invariant?) models. Go beyond clustering / low-rank / union of subspaces. [Olshausen and Field, 1996, Chen, Donoho, and Saunders, 1999, Tibshirani, 1996]... Julien Mairal Optim for large-scale ML and sparse estimation 5/71

8 Some references On kernel methods B. Schölkopf and A. J. Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond J. Shawe-Taylor and N. Cristianini. An introduction to support vector machines and other kernel-based learning methods slides: On sparse estimation M. Elad. Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing J. Mairal, F. Bach, and J. Ponce. Sparse Modeling for Image and Vision Processing free online. Julien Mairal Optim for large-scale ML and sparse estimation 6/71

9 Some references On large-scale optimization L. Bottou, F. E. Curtis and J. Nocedal. Optimization methods for large-scale machine learning, preprint arxiv: , Y. Nesterov. Introductory lectures on convex optimization: A basic course. Springer S. Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning slides by F. Bach: Julien Mairal Optim for large-scale ML and sparse estimation 7/71

Material on sparse estimation (freely available on arxiv) J. Mairal, F. Bach and J. Ponce. Sparse Modeling for Image and Vision Processing. Foundations and Trends in Computer Graphics and Vision.

10 Material on sparse estimation (freely available on arxiv) J. Mairal, F. Bach and J. Ponce. Sparse Modeling for Image and Vision Processing. Foundations and Trends in Computer Graphics and Vision F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning, 4(1) Julien Mairal Optim for large-scale ML and sparse estimation 8/71

11 Part I: Large-scale optimization for machine learning Julien Mairal Optim for large-scale ML and sparse estimation 9/71

12 Focus of this part Minimizing large finite sums Consider the minimization of a large sum of convex functions { } f(x) = 1 n f i (x)+ψ(x), n min x R d i=1 where each f i is L-smooth and convex and ψ is a convex regularization penalty but not necessarily differentiable. Julien Mairal Optim for large-scale ML and sparse estimation 10/71

13 Focus of this part Minimizing large finite sums Consider the minimization of a large sum of convex functions { } f(x) = 1 n f i (x)+ψ(x), n min x R d i=1 where each f i is L-smooth and convex and ψ is a convex regularization penalty but not necessarily differentiable. Why this setting? convexity makes it easy to obtain complexity bounds. convex optimization is often effective for non-convex problems. What we will not cover performance of approaches in terms of test error. Julien Mairal Optim for large-scale ML and sparse estimation 10/71

14 Introduction of a few optimization principles Convex Functions Why do we care about convexity? f(x) x Julien Mairal Optim for large-scale ML and sparse estimation 11/71

15 Introduction of a few optimization principles Convex Functions Local observations give information about the global optimum f(x) x x f(x) = 0 is a necessary and sufficient optimality condition for differentiable convex functions; it is often easy to upper-bound f(x) f. Julien Mairal Optim for large-scale ML and sparse estimation 11/71

16 Introduction of a few optimization principles An important inequality for L-smooth convex functions If f is convex and smooth f(x) x x 0 x f(x) f(x 0 )+ f(x 0 ) (x x 0 ); }{{} linear approximation if f is non-smooth, a similar inequality holds for subgradients. Julien Mairal Optim for large-scale ML and sparse estimation 12/71

17 Introduction of a few optimization principles An important inequality for smooth functions If f is L-Lipschitz continuous (f does not need to be convex) g(x) f(x) x x 1 x 0 x f(x) g(x) = f(x 0 )+ f(x 0 ) (x x 0 ) + L }{{} 2 x x ; linear approximation Julien Mairal Optim for large-scale ML and sparse estimation 13/71

18 Introduction of a few optimization principles An important inequality for smooth functions If f is L-Lipschitz continuous (f does not need to be convex) g(x) f(x) x x 1 x 0 x f(x) g(x) = f(x 0 )+ f(x 0 ) (x x 0 ) + L }{{} 2 x x ; linear approximation g(x) = C x0 + L 2 x 0 (1/L) f(x 0 ) x 2 2. Julien Mairal Optim for large-scale ML and sparse estimation 13/71

19 Introduction of a few optimization principles An important inequality for smooth functions If f is L-Lipschitz continuous (f does not need to be convex) g(x) f(x) x x 1 x 0 x f(x) g(x) = f(x 0 )+ f(x 0 ) (x x 0 ) + L }{{} 2 x x ; linear approximation x 1 = x 0 1 L f(x 0). (gradient descent step). Julien Mairal Optim for large-scale ML and sparse estimation 13/71

20 Introduction of a few optimization principles Gradient Descent Algorithm Assume that f is convex and L-smooth ( f is L-Lipschitz). Theorem Consider the algorithm x t x t 1 1 L f(x t 1). Then, f(x t ) f L x 0 x t Julien Mairal Optim for large-scale ML and sparse estimation 14/71

21 Proof (1/2) Proof of the main inequality for smooth functions We want to show that for all x and z, f(x) f(z)+ f(z) (x z)+ L 2 x z 2 2. Julien Mairal Optim for large-scale ML and sparse estimation 15/71

22 Proof (1/2) Proof of the main inequality for smooth functions We want to show that for all x and z, f(x) f(z)+ f(z) (x z)+ L 2 x z 2 2. By using Taylor s theorem with integral form, Then, f(x) f(z) = 1 0 f(tx+(1 t)z) (x z)dt. f(x) f(z) f(z) (x z) ( f(tx+(1 t)z) f(z)) (x z)dt ( f(tx+(1 t)z) f(z)) (x z) dt f(tx+(1 t)z) f(z) 2 x z 2dt (C.-S.) Lt x z 2 2dt = L 2 x z 2 2. Julien Mairal Optim for large-scale ML and sparse estimation 15/71

23 Proof (2/2) Proof of the theorem We have shown that for all x, f(x) g t(x) = f(x t 1)+ f(x t 1) (x x t 1)+ L 2 x xt g t is minimized by x t; it can be rewritten g t(x) = g t(x t)+ L 2 x xt 2 2. Then, f(x t) g t(x t) = g t(x ) L 2 x x t 2 2 = f(x t 1)+ f(x t 1) (x x L t 1)+ 2 x x t L 2 x x t 2 2 f + L 2 x x t L 2 x x t 2 2. By summing from t = 1 to T, we have a telescopic sum T T(f(x T) f ) f(x t) f L 2 x x L 2 x x T 2 2. t=1 Julien Mairal Optim for large-scale ML and sparse estimation 16/71

24 Introduction of a few optimization principles An important inequality for smooth and µ-strongly convex functions If f is L-Lipschitz continuous and f µ-strongly convex f(x) x x 0 x f(x) f(x 0 )+ f(x 0 ) (x x 0 )+ L 2 x x ; f(x) f(x 0 )+ f(x 0 ) (x x 0 )+ µ 2 x x ; Julien Mairal Optim for large-scale ML and sparse estimation 17/71

25 Introduction of a few optimization principles Proposition When f is µ-strongly convex and L-smooth, the gradient descent algorithm with step-size 1/L produces iterates such that f(x t ) f We call that a linear convergence rate. ( 1 µ ) t L x0 x 2 2. L 2 Remarks if f is twice differentiable, L and µ represent the larget and smallest eigenvalues of the Hessian, respectively. L/µ is called the condition number. Julien Mairal Optim for large-scale ML and sparse estimation 18/71

26 Proof We start from an inequality from the previous proof f(x t) f(x t 1)+ f(x t 1) (x x t 1)+ L 2 x x t L 2 x x t 2 2 f + L µ x x t L 2 2 x x t 2 2. In addition, we have that f(x t) f + µ 2 xt x 2 2, and thus Finally, x x t 2 2 L µ L+µ x x t ( 1 µ ) x x t L f(x t) f L 2 xt x 2 2 ( 1 µ ) t L x x L 2 Julien Mairal Optim for large-scale ML and sparse estimation 19/71

27 Introduction of a few optimization principles Remark: with stepsize 1/L, gradient descent may be interpreted as a majorization-minimization algorithm: g y (x) x new y = x old f(x) g y (x) f(x) Figure: At each step, we update x argmin x R p g y (x) Julien Mairal Optim for large-scale ML and sparse estimation 20/71

28 The proximal gradient method An important inequality for composite functions If f 0 is L-Lipschitz continuous g(x) f(x) = f 0 (x)+ψ(x) x x 1 x 0 x f 0 (x)+ψ(x) f 0 (x 0 )+ f 0 (x 0 ) (x x 0 )+ L 2 x x ψ(x); x 1 minimizes g. Julien Mairal Optim for large-scale ML and sparse estimation 21/71

29 The proximal gradient method Gradient descent for minimizing f consists of x t argmin x R p g t (x) x t x t 1 1 L f(x t 1). The proximal gradient method for minimizing f = f 0 +ψ consists of which is equivalent to 1 x t argmin x R p 2 x t argming t (x), x R p x t 1 1 L f 0(x t 1 ) x L ψ(x). It requires computing efficiently the proximal operator of ψ. 1 y argmin x R p 2 y x 2 2 +ψ(x). Julien Mairal Optim for large-scale ML and sparse estimation 22/71

30 The proximal gradient method Remarks also known as forward-backward algorithm; has similar convergence rates as the gradient descent method (the proof is nearly identical). there exists line search schemes to automatically tune L; The case of l 1 The proximal operator of λ. 1 is the soft-thresholding operator x[j] = sign(y[j])( y[j] λ) +. The resulting algorithm is called iterative soft-thresholding. [Nowak and Figueiredo, 2001, Daubechies et al., 2004, Combettes and Wajs, 2006, Beck and Teboulle, 2009, Wright et al., 2009, Nesterov, 2013]... Julien Mairal Optim for large-scale ML and sparse estimation 23/71

31 The proximal gradient method The proximal operator for the group Lasso penalty 1 min x R p 2 y x 2 2 +λ x[g] q. g G For q = 2, For q =, x[g] = y[g] y[g] 2 ( y[g] 2 λ) +, g G. x[g] = y[g] Π. 1 λ[y[g]], g G. These formula generalize soft-thresholding to groups of variables. Julien Mairal Optim for large-scale ML and sparse estimation 24/71

32 The proximal gradient method A few proximal operators: l 0 -penalty: hard-thresholding; l 1 -norm: soft-thresholding; group-lasso: group soft-thresholding; fused-lasso (1D total variation): [Hoefling, 2010]; total variation: [Chambolle and Darbon, 2009]; hierarchical norms: [Jenatton et al., 2011], O(p) complexity; overlapping group Lasso with l -norm: [Mairal et al., 2010]; Julien Mairal Optim for large-scale ML and sparse estimation 25/71

33 Accelerated gradient descent methods Nesterov introduced in the 80 s an acceleration scheme for the gradient descent algorithm. It was generalized later to the composite setting. FISTA 1 x t argmin x R p 2 (y x t 1 1 ) 2 L f 0(y t 1 ) + 1 L ψ(x); Find α t > 0 s.t. α 2 t = (1 α t )α 2 t 1 + µ L α t; y t x t +β t (x t x t 1 ) with β t = α t 1(1 α t 1 ) αt 1 2 +α. t f(x t ) f = O(1/t 2 ) for convex problems; f(x t ) f = O((1 µ/l) t ) for µ-strongly convex problems; Acceleration works in many practical cases. see [Beck and Teboulle, 2009, Nesterov, 1983, 2004, 2013] 2 Julien Mairal Optim for large-scale ML and sparse estimation 26/71

34 What do we mean by acceleration? Complexity analysis for large finite sums Since f is a sum of n functions, computing f requires computing n gradients f i. The complexity to reach an ε solution is given below Remarks µ > 0 µ = 0 ( ISTA O n L µ log( ) ) 1 ε O ( ) nl ε ( FISTA O n L µ log( 1 ε) ) ( ) L O n ε ε-solution means here f(x t ) f ε. For n = 1, the rates of FISTA are optimal for a first-order local black box [Nesterov, 2004]. For L = 1 and µ = 1/n, scales at best in n 3/2. Julien Mairal Optim for large-scale ML and sparse estimation 27/71

35 How does acceleration work? Unfortunately, the literature does not provide any simple geometric explanation... Julien Mairal Optim for large-scale ML and sparse estimation 28/71

36 How does acceleration work? Unfortunately, the literature does not provide any simple geometric explanation... but they are a few obvious facts and a mechanism introduced by Nesterov, called estimate sequence. Obvious fact Simple gradient descent steps are blind to the past iterates, and are based on a purely local model of the objective. Accelerated methods usually involve an extrapolation step y t = x t +β t (x t x t 1 ) with β t in (0,1). Nesterov interprets acceleration as relying on a better model of the objective called estimate sequence. Julien Mairal Optim for large-scale ML and sparse estimation 28/71

37 How does acceleration work? Definition of estimate sequence [Nesterov]. A pair of sequences (ϕ t ) t 0 and (λ t ) t 0, with λ t 0 and ϕ t : R p R, is called an estimate sequence of function f if λ t 0 and for any x R p and all t 0, ϕ t (x) f(x) λ t (ϕ 0 (x) f(x)). In addition, if for some sequence (x t ) t 0 we have then f(x t ) ϕ t where x is a minimizer of f. = min x R pϕ t(x), f(x t ) F λ t (ϕ 0 (x ) f ), Julien Mairal Optim for large-scale ML and sparse estimation 29/71

38 How does acceleration work? In summary, we need two properties 1 ϕ t (x) (1 λ t )f(x)+λ t ϕ 0 (x); 2 f(x t ) ϕ t = min x R p ϕ t (x). Remarks ϕ t is neither an upper-bound, nor a lower-bound; Finding the right estimate sequence is often nontrivial. Julien Mairal Optim for large-scale ML and sparse estimation 30/71

39 How does acceleration work? In summary, we need two properties 1 ϕ t (x) (1 λ t )f(x)+λ t ϕ 0 (x); 2 f(x t ) ϕ t = min x R p ϕ t (x). How to build an estimate sequence? Define ϕ t recursively ϕ t (x) = (1 α t )ϕ t 1 (x)+α t d t (x), where d t is a lower-bound, e.g., if F is smooth, d t (x) = F(y t )+ F(y t ) (x y t )+ µ 2 x y t 2 2, Then, work hard to choose α t as large as possible, and y t and x t such that property 2 holds. Subsequently, λ t = t t=1 (1 α t). Julien Mairal Optim for large-scale ML and sparse estimation 30/71

40 The stochastic (sub)gradient descent algorithm Consider now the minimization of an expectation min x R pf(x) = E z[l(x,z)], To simplify, we assume that for all z, x l(x,z) is differentiable. Algorithm At iteration t, Randomly draw one example z t from the training set; Update the current iterate x t x t 1 η t f t (x t 1 ) with f t (x) = l(x,z t ). Perform online averaging of the iterates (optional) x t (1 γ t ) x t 1 +γ t x t. Julien Mairal Optim for large-scale ML and sparse estimation 31/71

41 The stochastic (sub)gradient descent algorithm There are various learning rates strategies (constant, varying step-sizes), and averaging strategies. Depending on the problem assumptions and choice of η t, γ t, classical convergence rates may be obtained: f( x t ) f = O(1/ t) for convex problems; f( x t ) f = O(1/t) for strongly-convex ones; Remarks The convergence rates are not great, but the complexity per-iteration is small (1 gradient evaluation for minimizing an empirical risk versus n for the batch algorithm). When the amount of data is infinite, the method minimizes the expected risk (which is what we want). Choosing a good learning rate automatically is an open problem. Julien Mairal Optim for large-scale ML and sparse estimation 32/71

42 Proof of an O(1/ t) rate for the convex case Inspired by (aka, stolen from) F. Bach s slides Assumptions The solution lies in a bounded domain C = { x D}. The sub-gradients are bounded on C: f t (x) B. Fix in advance the number of iterations T and choose η t = 2D B T. Choose Polyak-Ruppert averaging x T = (1/T) T 1 t=0 x t. Perform updates with projections x t Π C [x t 1 η t f t (x t 1 )]. Proposition E[f ( x t ) f ] 2DB T. Julien Mairal Optim for large-scale ML and sparse estimation 33/71

43 Proof of an O(1/ t) rate for the convex case Inspired by (aka, stolen from) F. Bach s slides F t: information up to time t. x D and f t(x) B. Besides E[ f t(x) F t 1] = f(x). x t x 2 x t 1 η t f t(x t 1) x 2 Take now conditional expectations x t 1 x 2 +B 2 η 2 t 2η t(x t 1 x ) f t(x t 1). E[ x t x 2 F t 1] x t 1 x 2 +B 2 η 2 t 2η t(x t 1 x ) f(x t 1) Take now full expectations x t 1 x 2 +B 2 η 2 t 2η t(f(x t 1) f ). E[ x t x 2 ] E[ x t 1 x 2 ]+B 2 η 2 t 2η te[f(x t 1) f ], and, after reorganizing the terms E[f(x t 1) f ] B2 η 2 t η t ( E[ xt 1 x 2 ] E[ x t x 2 ] ). Julien Mairal Optim for large-scale ML and sparse estimation 34/71

44 Proof of an O(1/ t) rate for the convex case Inspired by (aka, stolen from) F. Bach s slides We start again from E[f(x t 1) f ] B2 η 2 t 2 and we exploit the telescopic sum T E[f(x t 1) f ] t=1 T t=1 B 2 η 2 t 2 T B2 η η t ( E[ xt 1 x 2 ] E[ x t x 2 ] ). + T t=1 1 ( E[ xt 1 x 2 ] E[ x t x 2 ] ) 2η t + 4D2 2η 2DB T with γ = 2D B T. Finally, we conclude by using a convexity inequality ( ) T 1 1 Ef f 2DB. T T t=0 Julien Mairal Optim for large-scale ML and sparse estimation 35/71

45 Back to finite sums Consider now the case of interest for us today: 1 n min f i (x), x R p n i=1 Question Can we do as well as SGD in terms of cost per iteration, while enjoying a fast (linear) convergence rate like (accelerated or not) gradient descent? For n = 1, no! The rates are optimal for a first-order local black box [Nesterov, 2004]. For n 1, yes! We need to design algorithms whose per-iteration computational complexity is smaller than n; whose convergence rate may be worse than FISTA......but with a better expected computational complexity. Julien Mairal Optim for large-scale ML and sparse estimation 36/71

46 Incremental gradient descent methods min x R p { f(x) = 1 n } n f i (x). Several randomized algorithms are designed with one f i computed per iteration, with fast convergence rates, e.g., SAG [Schmidt et al., 2013]: x k x k 1 γ n yi k with yi k f i (x k 1 ) if i = i k = Ln i=1 yi k 1 otherwise. i=1 Julien Mairal Optim for large-scale ML and sparse estimation 37/71

47 Incremental gradient descent methods min x R p { f(x) = 1 n } n f i (x). Several randomized algorithms are designed with one f i computed per iteration, with fast convergence rates, e.g., SAG [Schmidt et al., 2013]: x k x k 1 γ n yi k with yi k f i (x k 1 ) if i = i k = Ln i=1 yi k 1 otherwise. i=1 See also SVRG, SAGA, SDCA, MISO, Finito... Some of these algorithms perform updates of the form x k x k 1 η k g k with E[g k ] = f(x k 1 ), but g k has lower variance than in SGD. [Schmidt et al., 2013, Xiao and Zhang, 2014, Defazio et al., 2014a,b, Shalev-Shwartz and Zhang, 2012, Mairal, 2015, Zhang and Xiao, 2015] Julien Mairal Optim for large-scale ML and sparse estimation 37/71

48 Incremental gradient descent methods These methods achieve low (worst-case) complexity in expectation. The number of gradients evaluations to ensure f(x k ) f ε is FISTA SVRG, SAG, SAGA, SDCA, MISO, Finito O µ > 0 ( O n L µ log( ) ) 1 ε ( ( ) max n, L µ log ( ) ) 1 ε Julien Mairal Optim for large-scale ML and sparse estimation 38/71

49 Incremental gradient descent methods These methods achieve low (worst-case) complexity in expectation. The number of gradients evaluations to ensure f(x k ) f ε is FISTA SVRG, SAG, SAGA, SDCA, MISO, Finito O µ > 0 ( O n L µ log( ) ) 1 ε ( ( ) max n, L µ log ( ) ) 1 ε Main features vs. stochastic gradient descent Same complexity per-iteration (but higher memory footprint). Faster convergence (exploit the finite-sum structure). Less parameter tuning than SGD. Some variants are compatible with a composite term ψ. Julien Mairal Optim for large-scale ML and sparse estimation 38/71

50 Incremental gradient descent methods These methods achieve low (worst-case) complexity in expectation. The number of gradients evaluations to ensure f(x k ) f ε is FISTA SVRG, SAG, SAGA, SDCA, MISO, Finito O µ > 0 ( O n L µ log( ) ) 1 ε ( ( ) max n, L µ log ( ) ) 1 ε Important remarks When f i (x) = l(zi x), the memory footprint is O(n) otherwise O(dn), except for SVRG (O(d)). Some algorithms require an estimate of µ; L is the average (or max) of the Lipschitz constants of the f i s. The L for fista is the Lipschitz constant of f: L L. Julien Mairal Optim for large-scale ML and sparse estimation 38/71

51 Incremental gradient descent methods stealing again a bit from F. Bach s slides. Variance reduction Consider two random variables X, Y and define Then, E[Z] = E[X] Z = X Y +E[Y]. Var(Z) = Var(X)+Var(Y) 2cov(X,Y). The variance of Z may be smaller if X and Y are positively correlated. Julien Mairal Optim for large-scale ML and sparse estimation 39/71

52 Incremental gradient descent methods stealing again a bit from F. Bach s slides. Variance reduction Consider two random variables X, Y and define Then, E[Z] = E[X] Z = X Y +E[Y]. Var(Z) = Var(X)+Var(Y) 2cov(X,Y). The variance of Z may be smaller if X and Y are positively correlated. Why is it useful for stochatic optimization? step-sizes for SGD have to decrease to ensure convergence. with variance reduction, one may use constant step-sizes. Julien Mairal Optim for large-scale ML and sparse estimation 39/71

53 Incremental gradient descent methods SVRG x t = x t 1 γ( f it (x t 1 ) f it (y)+ f(y)), where y is updated every epoch and E[ f it (y) F t 1 ] = f(y). SAGA x t = x t 1 γ ( f it (x t 1 ) y t 1 i t where E[yi t 1 t F t 1 ] = 1 n n i=1 yt 1 i and yi t = + 1 n n ) i=1 yt 1 i, f i (x t 1 ) y t 1 i if i = i t otherwise. MISO/Finito: for n L/µ, same form as SAGA but 1 n n i=1 yt 1 i = µx t 1 and yi t = f i (x t 1 ) µx t 1 y t 1 i if i = i t otherwise. Julien Mairal Optim for large-scale ML and sparse estimation 40/71

54 Can we do even better for large finite sums? Without vs with acceleration FISTA SVRG, SAG, SAGA, SDCA, MISO, Finito Accelerated versions µ > 0 ( O n L µ log( ) ) 1 ε ( ( ) O max n, L µ log ( ) ) 1 ε ( ( ) Õ max n, n L µ log ( ) ) 1 ε Acceleration for specific algorithms [Shalev-Shwartz and Zhang, 2014, Lan, 2015, Allen-Zhu, 2016]. Generic acceleration: Catalyst [Lin et al., 2015]. see [Agarwal and Bottou, 2015] for discussions about optimality. Julien Mairal Optim for large-scale ML and sparse estimation 41/71

55 What we have not (or should have) covered Import approaches and concepts distributed optimization. proximal splitting / ADMM. Quasi-Newton approaches. Julien Mairal Optim for large-scale ML and sparse estimation 42/71

56 What we have not (or should have) covered Import approaches and concepts distributed optimization. proximal splitting / ADMM. Quasi-Newton approaches. The question Should we care that much about minimizing finite sums when all we want is minimizing an expectation? Julien Mairal Optim for large-scale ML and sparse estimation 42/71

57 Part II: Sparse estimation Julien Mairal Optim for large-scale ML and sparse estimation 43/71

58 Chronological overview of parsimony 14th century: Ockham s razor; 1921: Wrinch and Jeffreys simplicity principle; 1952: Markowitz s portfolio selection; 60 and 70 s: best subset selection in statistics; 70 s: use of the l 1 -norm for signal recovery in geophysics; 90 s: wavelet thresholding in signal processing; 1996: Olshausen and Field s dictionary learning; : Lasso (statistics) and basis pursuit (signal processing); 2006: compressed sensing (signal processing) and Lasso consistency (statistics); 2006 now: applications of dictionary learning in various scientific fields such as image processing and computer vision. Julien Mairal Optim for large-scale ML and sparse estimation 44/71

59 Sparsity in the statistics literature from the 60 s and 70 s How to choose k? Mallows s C p statistics [Mallows, 1964, 1966]; Akaike information criterion (AIC) [Akaike, 1973]; Bayesian information critertion (BIC) [Schwarz, 1978]; Minimum description length (MDL) [Rissanen, 1978]. These approaches lead to penalized problems min θ R pl(θ)+λ θ 0, with different choices of λ depending on the chosen criterion. Julien Mairal Optim for large-scale ML and sparse estimation 45/71

60 Sparsity in the statistics literature from the 60 s and 70 s How to solve the best k-subset selection problem? Unfortunately......the problem is NP-hard [Natarajan, 1995]. Two strategies combinatorial exploration with branch-and-bound techniques [Furnival and Wilson, 1974] leaps and bounds, exact algorithm but exponential complexity; greedy approach: forward selection [Efroymson, 1960] (originally developed for observing intermediate solutions), already contains all the ideas of matching pursuit algorithms. Important reference: [Hocking, 1976]. The analysis and selection of variables in linear regression. Biometrics. Julien Mairal Optim for large-scale ML and sparse estimation 46/71

Wavelet thresholding in signal processing from the 90 s Wavelets where the

1985]; steerable wavelets [Simoncelli et al.

Vertterli, 2003]; bandlets [Le Pennec and Mallat, 2005]; -lets (joke).

61 Wavelet thresholding in signal processing from the 90 s Wavelets where the topic of a long quest for representing natural images 2D-Gabors [Daugman, 1985]; steerable wavelets [Simoncelli et al., 1992]; curvelets [Candès and Donoho, 2002]; countourlets [Do and Vertterli, 2003]; bandlets [Le Pennec and Mallat, 2005]; -lets (joke). (a) 2D Gabor filter. (b) With shifted phase. (c) With rotation. Julien Mairal Optim for large-scale ML and sparse estimation 47/71

Wavelet thresholding in signal processing from 90 s The theory of wavelets is well developed for continuous signals, e.g., in L 2 (R), but also for discrete signals x in R n.

62 Wavelet thresholding in signal processing from 90 s The theory of wavelets is well developed for continuous signals, e.g., in L 2 (R), but also for discrete signals x in R n. Julien Mairal Optim for large-scale ML and sparse estimation 48/71

63 Wavelet thresholding in signal processing from 90 s Given an orthogonal wavelet basis D = [d 1,...,d n ] in R n n, the wavelet decomposition of x in R n is simply β = D x and we have x = Dβ. The k-sparse approximation problem 1 min α R p 2 x Dα 2 2 s.t. α 0 k, is not NP-hard here: since D is orthogonal, it is equivalent to 1 min α R p 2 β α 2 2 s.t. α 0 k. Julien Mairal Optim for large-scale ML and sparse estimation 49/71

64 Wavelet thresholding in signal processing from 90 s Given an orthogonal wavelet basis D = [d 1,...,d n ] in R n n, the wavelet decomposition of x in R n is simply β = D x and we have x = Dβ. The k-sparse approximation problem 1 min α R p 2 x Dα 2 2 s.t. α 0 k, The solution is obtained by hard-thresholding: α ht β[j] if β[j] µ [j] = δ β[j] µ β[j] = 0 otherwise, where µ the k-th largest value among the set { β[1],..., β[p] }. Julien Mairal Optim for large-scale ML and sparse estimation 49/71

65 Wavelet thresholding in signal processing, 90 s Another key operator is the soft-thresholding operator [see Donoho and Johnstone, 1994] : β[j] λ if β[j] λ α st [j] = sign(β[j])max( β[j] λ,0) = β[j]+λ if β[j] λ 0 otherwise, where λ is a parameter playing the same role as µ previously. With β = D x and D orthogonal, it provides the solution of the following sparse reconstruction problem: 1 min α R p 2 x Dα 2 2 +λ α 1, which will be of high importance later. Julien Mairal Optim for large-scale ML and sparse estimation 50/71

66 Wavelet thresholding in signal processing, 90 s α st α ht λ λ β µ µ β (d) Soft-thresholding operator, α st = sign(β)max( β λ,0). (e) Hard-thresholding operator α ht = δ β µ β. Figure: Soft- and hard-thresholding operators, which are commonly used for signal estimation with orthogonal wavelet basis. Julien Mairal Optim for large-scale ML and sparse estimation 51/71

The modern parsimony and the l 1 -norm Sparse linear models

..,d p ] R n p be a set of elementary signals.

D is adapted to x if it can represent it with a few elements

67 The modern parsimony and the l 1 -norm Sparse linear models in signal processing Let x in R n be a signal. Let D = [d 1,...,d p ] R n p be a set of elementary signals. We call it dictionary. D is adapted to x if it can represent it with a few elements that is, there exists a sparse vector α in R p such that x Dα. We call α the sparse code. α[1] x }{{} x R n d 1 d 2 d p α[2]. }{{ D R n p } α[p] }{{} α R p,sparse Julien Mairal Optim for large-scale ML and sparse estimation 52/71

68 The modern parsimony and the l 1 -norm Sparse linear models: machine learning/statistics point of view A few examples: Ridge regression: Linear SVM: 1 min β R p n 1 min β R p n 1 Logistic regression: min β R p n n 1 2 (y i β x i ) 2 +λ β 2 2. n max(0,1 y i β x i )+λ β 2 2. i=1 i=1 n i=1 ) log (1+e y iβ x i +λ β 2 2. Julien Mairal Optim for large-scale ML and sparse estimation 53/71

69 The modern parsimony and the l 1 -norm Sparse linear models: machine learning/statistics point of view A few examples: Ridge regression: Linear SVM: 1 min β R p n 1 min β R p n 1 Logistic regression: min β R p n n 1 2 (y i β x i ) 2 +λ β 2 2. n max(0,1 y i β x i )+λ β 2 2. i=1 i=1 n i=1 ) log (1+e y iβ x i +λ β 2 2. The squared l 2 -norm induces smoothness in β. When one knows in advance that β should be sparse, one should use a sparsity-inducing regularization such as the l 1 -norm. [Chen et al., 1999, Tibshirani, 1996] Julien Mairal Optim for large-scale ML and sparse estimation 54/71

70 The modern parsimony and the l 1 -norm Why does the l 1 -norm induce sparsity? Can we get some intuition from the simplest isotropic case? 1 ˆα(λ) = argmin α R p 2 x α 2 2 +λ α 1, or equivalently the Euclidean projection onto the l 1 -ball? 1 α(µ) = argmin α R p 2 x α 2 2 s.t. α 1 µ. equivalent means that for all λ > 0, there exists µ 0 such that α(µ) = ˆα(λ). Julien Mairal Optim for large-scale ML and sparse estimation 55/71

71 The modern parsimony and the l 1 -norm Why does the l 1 -norm induce sparsity? Can we get some intuition from the simplest isotropic case? 1 ˆα(λ) = argmin α R p 2 x α 2 2 +λ α 1, or equivalently the Euclidean projection onto the l 1 -ball? 1 α(µ) = argmin α R p 2 x α 2 2 s.t. α 1 µ. equivalent means that for all λ > 0, there exists µ 0 such that α(µ) = ˆα(λ). The relation between µ and λ is unknown a priori. Julien Mairal Optim for large-scale ML and sparse estimation 55/71

72 Why does the l 1 -norm induce sparsity? Regularizing with the l 1-norm α[2] l 1 -ball α[1] α 1 µ The projection onto a convex set is biased towards singularities. Julien Mairal Optim for large-scale ML and sparse estimation 56/71

73 Why does the l 1 -norm induce sparsity? Regularizing with the l 2-norm α[2] l 2 -ball α[1] α 2 µ The l 2 -norm is isotropic. Julien Mairal Optim for large-scale ML and sparse estimation 57/71

74 Why does the l 1 -norm induce sparsity? In 3D. (images produced by G. Obozinski) Julien Mairal Optim for large-scale ML and sparse estimation 58/71

75 Why does the l 1 -norm induce sparsity? Regularizing with the l -norm α[2] l -ball α[1] α µ The l -norm encourages α[1] = α[2]. Julien Mairal Optim for large-scale ML and sparse estimation 59/71

76 Why does the l 1 -norm induce sparsity? Analytical point of view: 1D case 1 min α R 2 (x α)2 +λ α Piecewise quadratic function with a kink at zero. Derivative at 0 + : g + = x+λ and 0 : g = x λ. Optimality conditions. α is optimal iff: α > 0 and (x α)+λsign(α) = 0 α = 0 and g + 0 and g 0 The solution is a soft-thresholding: α = sign(x)( x λ) +. Julien Mairal Optim for large-scale ML and sparse estimation 60/71

77 Why does the l 1 -norm induce sparsity? Comparison with l 2-regularization in 1D ψ(α) = α 2 ψ(α) = α ψ (α) = 2α ψ (α) = 1, ψ +(α) = 1 The gradient of the l 2 -penalty vanishes when α get close to 0. On its differentiable part, the norm of the gradient of the l 1 -norm is constant. Julien Mairal Optim for large-scale ML and sparse estimation 61/71

78 Why does the l 1 -norm induce sparsity? Physical illustration E 1 = 0 E 1 = 0 x Julien Mairal Optim for large-scale ML and sparse estimation 62/71

79 Why does the l 1 -norm induce sparsity? Physical illustration E 1 = k 1 2 (x α) 2 E 1 = k 1 2 (x α) 2 E 2 = k 2 2 α 2 α α E 2 = mgα Julien Mairal Optim for large-scale ML and sparse estimation 62/71

80 Why does the l 1 -norm induce sparsity? Physical illustration E 1 = k 1 2 (x α) 2 E 1 = k 1 2 (x α) 2 E 2 = k 2 2 α 2 α α = 0!! E 2 = mgα Julien Mairal Optim for large-scale ML and sparse estimation 62/71

81 Why does the l 1 -norm induce sparsity? coefficient values w 1 w 2 w 3 w 4 w λ Figure: The regularization path of the Lasso. 1 min α R p 2 x Dα 2 2 +λ α 1. Julien Mairal Optim for large-scale ML and sparse estimation 63/71

82 Non-convex sparsity-inducing penalties Exploiting concave functions with a kink at zero ψ(α) = p j=1 ϕ( α[j] ). l q -penalty, with 0 < q < 1: ψ(α) = p j=1 α[j] q, [Frank and Friedman, 1993]; log penalty, ψ(α) = p j=1 log( α[j] +ε). ϕ is any function that looks like this: ϕ( α ) α Julien Mairal Optim for large-scale ML and sparse estimation 64/71

83 Non-convex sparsity-inducing penalties (a) l 0.5-ball, 2-D (b) l 1-ball, 2-D (c) l 2-ball, 2-D Figure: Open balls in 2-D corresponding to several l q -norms and pseudo-norms. Julien Mairal Optim for large-scale ML and sparse estimation 65/71

84 Non-convex sparsity-inducing penalties α[2] α q µ with q < 1 l q -ball α[1] Julien Mairal Optim for large-scale ML and sparse estimation 66/71

85 Elastic-net The elastic net introduced by [Zou and Hastie, 2005] ψ(α) = α 1 +γ α 2 2, The penalty provides more stable (but less sparse) solutions. (a) l 1-ball, 2-D (b) elastic-net, 2-D (c) l 2-ball, 2-D Julien Mairal Optim for large-scale ML and sparse estimation 67/71

86 The elastic-net vs other penalties α[2] α q µ with q < 1 l q -ball α[1] Julien Mairal Optim for large-scale ML and sparse estimation 68/71

87 The elastic-net vs other penalties α[2] l 1 -ball α[1] α 1 µ Julien Mairal Optim for large-scale ML and sparse estimation 68/71

88 The elastic-net vs other penalties α[2] (1 γ) α 1 +γ α 2 2 µ elastic-net ball α[1] Julien Mairal Optim for large-scale ML and sparse estimation 68/71

89 The elastic-net vs other penalties α[2] l 2 -ball α[1] α 2 2 µ Julien Mairal Optim for large-scale ML and sparse estimation 68/71

90 Structured sparsity images produced by G. Obozinski Julien Mairal Optim for large-scale ML and sparse estimation 69/71

91 Structured sparsity images produced by G. Obozinski Julien Mairal Optim for large-scale ML and sparse estimation 70/71

92 Mark the date! July 2-6th, Grenoble Along with Naver Labs, Inria is organizing a summer school in Grenoble on artificial intelligence. Visit Among the distinguished speakers Lourdes Agapito (UCL) Kyunghyun Cho (NYU/Facebook) Emmanuel Dupoux (EHESS) Martial Hebert (CMU) Hugo Larochelle (Google Brain) Yann LeCun (Facebook/NYU) Jean Ponce (Inria) Cordelia Schmid (Inria) Andrew Zisserman (Oxford/Google DeepMind).... Julien Mairal Optim for large-scale ML and sparse estimation 71/71

93 References I A. Agarwal and L. Bottou. A lower bound for the optimization of finite sums. In Proceedings of the International Conference on Machine Learning (ICML), H. Akaike. Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory, volume 1, pages , Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. arxiv preprint arxiv: , A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1): , Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. The Journal of Machine Learning Research, 14 (1): , Julien Mairal Optim for large-scale ML and sparse estimation 72/71

94 References II Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. arxiv preprint arxiv: , E. J. Candès and D. L. Donoho. Recovering edges in ill-posed inverse problems: Optimality of curvelet frames. Annals of Statistics, 30(3): , Antonin Chambolle and Jérôme Darbon. On total variation minimization and surface evolution using parametric maximum flows. International journal of computer vision, 84(3):288, S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20:33 61, P. L. Combettes and V. R. Wajs. Signal recovery by proximal forward-backward splitting. SIAM Multiscale Modeling and Simulation, 4(4): , David Corfield, Bernhard Schölkopf, and Vladimir Vapnik. Falsificationism and statistical learning theory: Comparing the popper and vapnik-chervonenkis dimensions. Journal for General Philosophy of Science, 40(1):51 58, I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics, 57(11): , Julien Mairal Optim for large-scale ML and sparse estimation 73/71

95 References III J. G. Daugman. Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical filters. Journal of the Optical Society of America A, 2(7): , A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems (NIPS), 2014a. A. J. Defazio, T. S. Caetano, and J. Domke. Finito: A faster, permutable incremental gradient method for big data problems. In Proceedings of the International Conference on Machine Learning (ICML), 2014b. M. Do and M. Vertterli. Contourlets, Beyond Wavelets. Academic Press, D. L. Donoho and J. M. Johnstone. Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81(3): , M. A. Efroymson. Multiple regression analysis. Mathematical methods for digital computers, 9(1): , I. E Frank and J. H. Friedman. A statistical view of some chemometrics regression tools. Technometrics, 35(2): , Julien Mairal Optim for large-scale ML and sparse estimation 74/71

96 References IV G. M. Furnival and R. W. Wilson. Regressions by leaps and bounds. Technometrics, 16(4): , R. R. Hocking. A Biometrics invited paper. The analysis and selection of variables in linear regression. Biometrics, 32:1 49, H. Hoefling. A path algorithm for the fused lasso signal approximator. Journal of Computational and Graphical Statistics, 19(4): , R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for hierarchical sparse coding. Journal of Machine Learning Research, 12: , Guanghui Lan. An optimal randomized incremental gradient method. arxiv preprint arxiv: , E. Le Pennec and S. Mallat. Sparse geometric image representations with bandelets. IEEE Transactions on Image Processing, 14(4): , H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems, Julien Mairal Optim for large-scale ML and sparse estimation 75/71

97 References V J. Mairal. Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM Journal on Optimization, 25(2): , J. Mairal, R. Jenatton, G. Obozinski, and F. Bach. Network flow algorithms for structured sparsity. In Advances in Neural Information Processing Systems (NIPS), C. L. Mallows. Choosing variables in a linear regression: A graphical aid. unpublished paper presented at the Central Regional Meeting of the Institute of Mathematical Statistics, Manhattan, Kansas, C. L. Mallows. Choosing a subset regression. unpublished paper presented at the Joint Statistical Meeting, Los Angeles, California, B. K. Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on Computing, 24: , Y. Nesterov. Introductory lectures on convex optimization: a basic course. Kluwer Academic Publishers, Y. Nesterov. Gradient methods for minimizing composite objective function. Mathematical Programming, 140(1): , Julien Mairal Optim for large-scale ML and sparse estimation 76/71

98 References VI Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o (1/k2). In Doklady an SSSR, volume 269, pages , R. D. Nowak and M. A. T. Figueiredo. Fast wavelet-based image deconvolution using the EM algorithm. In Conference Record of the Thirty-Fifth Asilomar Conference on Signals, Systems and Computers., B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381: , J. Rissanen. Modeling by shortest data description. Automatica, 14(5): , M. Schmidt, N. Le Roux, and F. Bach. Minimizing finite sums with the stochastic average gradient. arxiv: , Bernhard Schölkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. On causal and anticausal learning. arxiv preprint arxiv: , Julien Mairal Optim for large-scale ML and sparse estimation 77/71

99 References VII G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6(2): , S. Shalev-Shwartz and T. Zhang. Proximal stochastic dual coordinate ascent. arxiv: , S. Shalev-Shwartz and T. Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming, pages 1 41, E. P. Simoncelli, W. T. Freeman, E. H. Adelson, and D. J. Heeger. Shiftable multiscale transforms. IEEE Transactions on Information Theory, 38(2): , R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society: Series B, 58(1): , Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, S.J. Wright, R.D. Nowak, and M.A.T. Figueiredo. Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing, 57(7): , Julien Mairal Optim for large-scale ML and sparse estimation 78/71

100 References VIII D. Wrinch and H. Jeffreys. XLII. On certain fundamental principles of scientific inquiry. Philosophical Magazine Series 6, 42(249): , L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4): , Y. Zhang and L. Xiao. Stochastic primal-dual coordinate method for regularized empirical risk minimization. In Proceedings of the International Conference on Machine Learning (ICML), H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B, 67(2): , Julien Mairal Optim for large-scale ML and sparse estimation 79/71

Introduction to Three Paradigms in Machine Learning. Julien Mairal

Introduction to Three Paradigms in Machine Learning Julien Mairal Inria Grenoble Yerevan, 208 Julien Mairal Introduction to Three Paradigms in Machine Learning /25 Optimization is central to machine learning