Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Size: px

Start display at page:

Download "Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19"

Magnus Robbins
5 years ago
Views:

1 Journal Club A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) CMAP, Ecole Polytechnique March 8th, /19

2 Plan 1 Motivations 2 Existing Acceleration Methods 3 Universal Catalyst 4 Conclusion 2/19

3 Motivations 3/19

4 Large class of problems Unconstrained Minimization of a large sum of functions { } min F (x) 1 n f i (x) + ψ(x) x R p n i=1 (1) where f i are L i smooth and convex and ψ is a convex penalty but not necessarily differentiable. f i (x) f i (y) < L i x y (2) Goal: Provide an acceleration scheme that can apply to existing un-accelerated methods Acceleration in the sense of Nesterov 4/19

5 Empirical Risk Minimization Given training data (y i, z i ) n i=1 where y i are responses and z i are regressors. x here represents the model parameters. F is the loss function and measures how well the model fit the training data and ψ prevents from overfitting Example: logistic regression. Reponses y i take values in {0, 1}: { } 1 n min log(1 + e y i <x,z i > ) + λ x 2 x R p n i=1 Large-scale dimension leads to first-order gradient-based methods (3) 5/19

6 Main contribtions Generic acceleration scheme, which applies to previously unaccelerated algorithms such as SVRG, SAG, SAGA, SDCA, MISO, or Finito, and which is not tailored to finite sums. Complexity analysis for µ-strongly convex objectives. Complexity analysis for non-strongly convex objectives. 6/19

7 Existing Acceleration Methods 7/19

8 First order method Classical way to solve the problem without the penalty ( min f (x)) is by gradient descent method (L smooth x Rp objective function): x k = x k 1 1 L f (x k 1 ) (4) Can be viewed as a proximal regularisation of the linearized function f at x k 1 (Beck, Teboulle, 2009): x k = arg min x R p{f (x k 1 )+ x x k 1, f (x k 1 ) + 1 L x x k 1 2 } Leads to ISTA when adding a penalty x k 1 = arg min x R p 2 { x (x k 1 1 L f (x k 1 )) ψ(x)} (6) L (5) 8/19

9 Nesterov Acceleration (1980), Nesterov introduced an acceleration scheme adding a memory term to the descent: x k 1 = arg min x R p 2 { x (y k 1 1 L f (y k 1 )) ψ(x)} (7) L with y k 1 = x k 1 + β k (x k 1 x k 2 ) and 0 < β k < 1 Complexity to reach an ɛ solution: Algo µ > 0 µ = 0 ISTA O(n L µ log(1/ɛ)) O( nl ɛ ) FISTA O(n L µ log(1/ɛ)) O( nl ɛ ) ɛ solution means f (x k ) f (x ) ɛ Large sum structure of f not exploited here 9/19

10 SAG/SAGA/MISO Randomized algorithms take into account the structure of the objective function and compute only one random gradient at each iteration which yields a better expected computation complexity To get E [ f (x k ) f (x ) ] ɛ we need O(1/ɛ) iterations Algo µ > 0 SAG, SAGA, MISO etc.. O(max ( n, L µ ) log(1/ɛ)) FISTA O(n L µ log(1/ɛ)) Acceleration when the number of observations is large enough: ( max n, L ) L n µ µ n L (8) µ Not in the sense of Nesterov though (Acceleration due to incremental update, not to a memory term) See Bottou et. al /19

11 Universal Catalyst 11/19

12 Universal Catalyst Challenge: can we accelerate these algorithms by a universal scheme for both convex and strongly convex objectives? Given any algorithm M that can solve a convex problem, at iteration k, rather than minimizing F (x), use as many iterations of M as needed to minimize: G k (x) F (x) + K 2 x y k 1 2 (9) such that G k (x) G ɛ k. Compute y k = x k + β k (x k x k 1 ) with β k = α k 1(1 α k 1 ) α 2 k 1 +α k, α 2 k = (1 α k)α 2 k 1 + qα k and q = µ µ+k The Catalyst algorithm A is a wrapper of M that takes advantage of both basic M-M scheme and Nesterov acceleration 12/19

13 Two stages algorithm G k is easier to minimize than F G k is always strongly convex as long as F is convex G k has a better condition number when F is strongly convex ( L+K µ+k < L µ ) need to find a trade-off between K >> 1 (easy) and K = 0. Inner loop: How many iterations of M to obtain the ɛ k precision (G k (x) G ɛ k ) Outter loop: with the sequences of (x k ) obtained by M, wisely choose the update y k (stepsize β k ) to obtain optimal rate on F (x k ) F 13/19

14 Main Theorem For strongly convex objective Choose α 0 = q and q = µ µ+k and the sequence: ɛ k = 2 9 (F (x 0 ) F )(1 ρ) k (10) Then the algorithm generates iterates (x k ) such that: F (x k ) F C(1 ρ) k+1 (F (x 0 ) F ) with C = 8 ( q ρ) 2 (11) In practice ρ = 0.9 q and since we don t know F for non negative function we can set ɛ k = 2 9 F (x 0 )(1 ρ) k 14/19

15 Main Theorem For non strongly convex objective Choose α 0 = ( 5 1)/2 and the sequence: ɛ k = 2(F (x 0 ) F ) 9(k + 2) 4+η) (12) Then the algorithm generates iterates (x k ) such that: F (x k ) F In practice η = (k + 2) 2 ((1 + 2/η)2 F (x 0 ) F + K 2 x 0 x 2 ) (13) 15/19

16 Inner loop algorithm An appropriate M (applied to G k ) for a strongly convex objective function HAS to ta have a linear convergence rate, i.e. there exists τ M such that: G k (z t ) G k (1 τ M ) t (G k (z 0 ) G k ) (14) τ M depends on the condition number. ISTA: τ M,F = µ/l and FISTA: τ M,F = µ/l Thanks to the quadratic term added to F we can achieve faster rates since τ M,G k = µ+k L+K > τ M,F With the proposed sequence (ɛ k ) the precision is reached, choosing z 0 = x k 1 with Strongly convex case: constant number of iterations Õ( 1 τ M ) Convex case: constant number of iterations Õ( 1 τ M ) log(k + 2) 16/19

17 Conclusion 17/19

18 Expected Computational complexity Case when n L/µ when µ > 0 Algo µ > 0 µ = 0 Cat. µ > 0 Cat. µ = 0 FG O(n L µ log( 1 ɛ )) O(n L ɛ ) Õ(n L µ log( 1 ɛ )) Õ(n L ɛ ) SAGA O( L µ log( 1 ɛ )) O(n L ɛ ) Õ( nl µ log( 1 ɛ )) Õ(n L ɛ ) MISO O( L µ log( 1 ɛ )) NA Õ( nl µ log( 1 ɛ )) Õ(n L ɛ ) Plus: Simple acceleration scheme that applies to large class of methods Recover Optimal rates for known algorithms Simple to implement Minus: Acceleration when n L/µ otherwise hard to beat 0(n log(1/ɛ)) µ is just an estimate of the true strong convexity µ µ When n L/µ but n L/µ apperas to be hard to accelerate. 18/19

19 Thank you 19/19

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine