A Universal Catalyst for Gradient-Based Optimization

Size: px
Start display at page:

Download "A Universal Catalyst for Gradient-Based Optimization"

Transcription

1 A Universal Catalyst for Gradient-Based Optimization Julien Mairal Inria, Grenoble CIMI workshop, Toulouse, 2015 Julien Mairal, Inria Catalyst 1/58

2 Collaborators Hongzhou Lin Zaid Harchaoui Publication H. Lin, J. Mairal and Z. Harchaoui. A Universal Catalyst for First-Order Optimization. Adv. NIPS Julien Mairal, Inria Catalyst 2/58

3 Focus of this work Minimizing large finite sums Given some data, learn some model parameters x in R p by minimizing { } min F(x) = 1 n f i (x)+ψ(x), x R p n where each f i is smooth and convex and ψ is a convex but not necessarily differentiable regularization penalty. Goal of this work Design accelerated methods for minimizing large finite sums. i=1 Accelerate previously un-accelerated algorithms. Julien Mairal, Inria Catalyst 3/58

4 Why do large finite sums matter? Empirical risk minimization { min x R p F(x) = 1 n } n f i (x)+ψ(x), i=1 Typically, x represents model parameters. Each function f i measures the fidelity of x to a data point. ψ is a regularization function to prevent overfitting. For instance, given training data (y i,z i ) i=1,...,n with features z i in R p and labels y i in { 1,+1}, f i may measure how far the prediction sign( z i,x ) is from the true label y i. This would be a classification problem with a linear model. Julien Mairal, Inria Catalyst 4/58

5 Why large finite sums matter? A few examples Ridge regression: Linear SVM: 1 min x R p n 1 min x R p n 1 Logistic regression: min x R p n n 1 2 (y i x,z i ) 2 + λ 2 x 2 2. n max(0,1 y i x,z i )+ λ 2 x 2 2. i=1 i=1 n i=1 ( ) log 1+e y i x,z i + λ 2 x 2 2. Julien Mairal, Inria Catalyst 5/58

6 Why does the composite problem matter? A few examples Ridge regression: Linear SVM: 1 min x R p n 1 min x R p n 1 Logistic regression: min x R p n n 1 2 (y i x,z i ) 2 + λ 2 x 2 2. n max(0,1 y i x,z i )+ λ 2 x 2 2. i=1 i=1 n i=1 ( ) log 1+e y i x,z i + λ 2 x 2 2. Regularization was called by Vladimir Vapnik one of the first signs of the existence of intelligent inference. The squared l 2 -norm penalizes large entries in x. Julien Mairal, Inria Catalyst 6/58

7 Why does the composite problem matter? A few examples Ridge regression: Linear SVM: 1 min x R p n 1 min x R p n 1 Logistic regression: min x R p n n 1 2 (y i x,z i ) 2 +λ x 1. n max(0,1 y i x,z i ) 2 +λ x 1. i=1 i=1 n i=1 ( ) log 1+e y i x,z i +λ x 1. When one knows in advance that x should be sparse, one should use a sparsity-inducing regularization such as the l 1 -norm. [Chen et al., 1999, Tibshirani, 1996]. Julien Mairal, Inria Catalyst 7/58

8 What do we mean by acceleration? Let us consider the composite problem min x R pf(x)+ψ(x), where f is convex, differentiable with L-Lipschitz continuous gradient and ψ is convex, but not necessarily differentiable. The classical forward-backward/ista algorithm 1 x k argmin x R p 2 (x x k 1 1 ) 2 L f(x k 1) + 1 L ψ(x). f(x k ) f = O(1/k) for convex problems; f(x k ) f = O((1 µ/l) k ) for µ-strongly convex problems; [Nowak and Figueiredo, 2001, Daubechies et al., 2004, Combettes and Wajs, 2006, Beck and Teboulle, 2009, Wright et al., 2009, Nesterov, 2013]... 2 Julien Mairal, Inria Catalyst 8/58

9 What do we mean by acceleration? Nesterov introduced in the 80 s an acceleration scheme for the gradient descent algorithm. It was generalized later to the composite setting. FISTA [Beck and Teboulle, 2009] 1 x k argmin x R p 2 (y x k 1 1 ) 2 L f(y k 1) + 1 L ψ(x); Find α k > 0 s.t. αk 2 = (1 α k)αk µ L α t; ( ) α k 1 (1 α k 1 ) y k x k 1 + αk 1 2 +α (x t x k 1 ). k f(x k ) f = O(1/k 2 ) for convex problems; f(x k ) f = O((1 µ/l) k ) for µ-strongly convex problems; 2 see also [Nesterov, 1983, 2004, 2013] Julien Mairal, Inria Catalyst 9/58

10 What do we mean by acceleration? Remarks Acceleration works in many practical cases. FISTA is parameter-free thanks to simple line-search schemes. The original FISTA paper was for µ = 0 and α 0 = ( 5 1)/2. Complexity analysis for large finite sums To minimize a sum of n functions with the guarantee f(x k ) f ε, the number of gradient evaluations is upper-bounded by ISTA FISTA O ( O µ > 0 µ = 0 n L µ log( ) ) 1 ε O ( ) nl ε ( n L µ log( 1 ε) ) O ( ) nl ε Julien Mairal, Inria Catalyst 10/58

11 Can we do better for large finite sums? For n = 1, no! The rates are optimal for a first-order local black box [Nesterov, 2004]. For n 1, yes! We need to design algorithms whose per-iteration computational complexity is smaller than n; whose convergence rate may be worse than FISTA......but with a better expected computational complexity. µ > 0 ( FISTA O n L µ log( ) ) 1 ε (( ) SVRG, SAG, SAGA, SDCA, MISO, Finito O n+ L µ log ( ) ) 1 ε [Schmidt et al., 2013, Xiao and Zhang, 2014, Defazio et al., 2014a,b, Shalev-Shwartz and Zhang, 2012, Mairal, 2015, Zhang and Xiao, 2015] Julien Mairal, Inria Catalyst 11/58

12 Can we do even better for large finite sums? SVRG, SAG, SAGA, SDCA, MISO, Finito improve upon FISTA when n L µ, but they are not accelerated in the sense of Nesterov. Witout vs with acceleration SVRG, SAG, SAGA, SDCA, MISO, Finito Acc-SDCA µ > 0 (( ) O n+ L µ log ( ) ) 1 ε (( ) Õ n+ n L µ log ( ) ) 1 ε Remarks Acceleration occurs when L µ n. see [Agarwal and Bottou, 2015] for discussions about optimality. Acc-SDCA is due to Shalev-Shwartz and Zhang [2014]. Julien Mairal, Inria Catalyst 12/58

13 Outline of this presentation Part I: Catalyst Part II: Proximal MISO Julien Mairal, Inria Catalyst 13/58

14 Outline of this presentation Part I: Catalyst Part II: Proximal MISO Julien Mairal, Inria Catalyst 13/58

15 Main idea Catalyst, a meta-algorithm Consider an algorithm M that can minimize F up to any accuracy. At iteration k, replace F by a stabilized counterpart G k centered at the prox-center y k. G k (x) = F(x)+ κ 2 x y k 2 2, and minimize G k up to accuracy ε k, i.e., such that G k (x k ) G k ε k. Compute the next prox-center y k using an extrapolation step y k = x k +β k (x k x k 1 ). The choices of β k,ǫ k,κ are driven by the theoretical analysis. Catalyst is a wrapper of M that yields an accelerated algorithm A. Julien Mairal, Inria Catalyst 14/58

16 This work Contributions Generic acceleration scheme, which applies to previously unaccelerated algorithms such as SVRG, SAG, SAGA, SDCA, MISO, or Finito, and which is not taylored to finite sums. Provides explicit support to non-strongly convex objectives. Complexity analysis for µ-strongly convex objectives. Complexity analysis for non-strongly convex objectives. Example of application Garber and Hazan [2015] have used Catalyst to accelerate new principal component analysis algorithms based on convex optimization. Julien Mairal, Inria Catalyst 15/58

17 Sources of inspiration In addition to accelerated proximal algorithms [Beck and Teboulle, 2009, Nesterov, 2013], there are two works that have inspired Catalyst. The inexact accelerated proximal point algorithm of Güler [1992]. Catalyst is a variant of inexact accelerated PPA. Complexity analysis for outer-loop only. Accelerated SDCA of Shalev-Shwartz and Zhang [2014]. Accelerated SDCA is an instance of inexact accelerated PPA. Complexity analysis limited to µ-strongly convex objectives. The complexity analysis is not just a theoretical exercise since it provides the values of κ,ε k,β k, which are required in concrete implementations. Here, theoretical values match practical ones. Julien Mairal, Inria Catalyst 16/58

18 Other related works Analysis of inexact accelerated algorithms. Accelerated PPA with various inexactness criterions [Salzo and Villa, 2012, He and Yuan, 2012], and with complexity analysis limited to outer-loop. Inexact accelerated forward-backward [Schmidt et al., 2011]. Unregularizing [Frostig et al., 2015]. Acceleration based on inexact accelerated PPA. Complexity analysis limited to µ-strongly convex objectives. Randomized primal-dual gradient [Lan, 2015]. Algorithm for minimizing large finite sums with built-in acceleration (no outer-loop). primal-dual and game-theoretic interpretation. Julien Mairal, Inria Catalyst 17/58

19 Theoretical setting Linear convergence of M Assume that F is µ-strongly convex. An algorithm M has a linear convergence rate if there exists τ M,F in (0,1) and a constant C M,F in R such that F(x k ) F C M,F (1 τ M,F ) k. (1) For a given algorithm, τ M,F depends usually on the condition number L/µ, e.g., τ M,F = µ/l for ISTA. Catalyst applies to algorithms that satisfy (1), which is classical for gradient-based approaches, yielding a method A with τ A,F τ M,F. Julien Mairal, Inria Catalyst 18/58

20 Catalyst action Catalyst action Consider an algorithm M that can minimize F up to any accuracy. At iteration k, replace F by a stabilized counterpart G k centered at the prox-center y k. G k (x) = F(x)+ κ 2 x y k 2 2, and minimize G k up to accuracy ε k. If κ 1, then minimizing G k is easy (the new condition number is L+κ µ+κ L µ ). If κ 0, then G k is a good approximation of F. Julien Mairal, Inria Catalyst 19/58

21 How does acceleration work? Unfortunately, the literature does not provide any simple geometric explanation... Julien Mairal, Inria Catalyst 20/58

22 How does acceleration work? Unfortunately, the literature does not provide any simple geometric explanation... but they are a few obvious facts, and a mechanism introduced by Nesterov, called estimate sequence. Obvious fact Simple gradient descent steps are blind to the past iterates, and are based on a purely local model of the objective. Accelerated methods usually involve an extrapolation step y k = x k +β k (x k x k 1 ) with β k in (0,1). Julien Mairal, Inria Catalyst 20/58

23 How does acceleration work? Unfortunately, the literature does not provide any simple geometric explanation... but they are a few obvious facts, and a mechanism introduced by Nesterov, called estimate sequence. Obvious fact Simple gradient descent steps are blind to the past iterates, and are based on a purely local model of the objective. Accelerated methods usually involve an extrapolation step y k = x k +β k (x k x k 1 ) with β k in (0,1). Unfortunately, extrapolation only gives a vague intuition and does not explain why it leads to a concrete converging accelerated algorithms. Julien Mairal, Inria Catalyst 20/58

24 How does acceleration work? If f is µ-strongly convex and f is L-Lipschitz continuous f(x) x x k 1 x f(x) f(x k 1 )+ f(x k 1 ) (x x k 1 )+ L 2 x x k ; f(x) f(x k 1 )+ f(x k 1 ) (x x k 1 )+ µ 2 x x k ; Julien Mairal, Inria Catalyst 21/58

25 How does acceleration work? If f is L-Lipschitz continuous f(x) x x k x k 1 x f(x) f(x k 1 )+ f(x k 1 ) (x x k 1 )+ L 2 x x k ; x k = x k 1 1 L f(x k 1) (gradient descent step). Julien Mairal, Inria Catalyst 22/58

26 How does acceleration work? Gradient descent with step-size 1/L can be interpreted as iteratively minimizing local upper-bounds of the objective. This is also the case for ISTA for minimizing F(x) = f(x)+ψ(x): x k argminf(x k 1 )+ f(x k 1 ) (x x k 1 )+ L x R p 2 x x k ψ(x). }{{} upper bound of f(x)+ψ(x) Instead, accelerated gradient methods construct models of the objective with less conservative and less local principles (use of lower-bounds and upper-bounds, use of previous models to construct the current one). These models, called estimate sequences, were introduced by Nesterov. They provide methodology for designing algorithms and proving their convergence. Julien Mairal, Inria Catalyst 23/58

27 How does acceleration work? Definition of estimate sequence [Nesterov]. A pair of sequences (ϕ k ) t 0 and (λ k ) t 0, with λ k 0 and ϕ k : R p R, is called an estimate sequence of function F if λ k 0 and for any x R p and all t 0, ϕ k (x) F(x) λ k (ϕ 0 (x) F(x)). In addition, if for some sequence (x k ) k 0 we have then F(x k ) ϕ k where x is a minimizer of F. = min x R pϕ k(x), F(x k ) F λ k (ϕ 0 (x ) F ), Julien Mairal, Inria Catalyst 24/58

28 How does acceleration work? In summary, we need two properties 1 ϕ k (x) (1 λ k )F(x)+λ k ϕ 0 (x); 2 F(x k ) ϕ k = min x R p ϕ k (x). Remarks ϕ k is neither an upper-bound, nor a lower-bound; Finding the right estimate sequence is often nontrivial. Julien Mairal, Inria Catalyst 25/58

29 How does acceleration work? In summary, we need two properties 1 ϕ k (x) (1 λ k )F(x)+λ k ϕ 0 (x); 2 F(x k ) ϕ k = min x R p ϕ k (x). How to build an estimate sequence? Define ϕ k recursively ϕ k (x) = (1 α k )ϕ k 1 (x)+α k d k (x), where d k is a lower-bound, e.g., if F is smooth, d k (x) = F(y k )+ F(y k ) (x y k )+ µ 2 x y k 2 2, Then, work hard to choose α k as large as possible, and y k and x k such that property 2 holds. Subsequently, λ k = k t=1 (1 α t). Julien Mairal, Inria Catalyst 25/58

30 Back to Catalyst Catalyst consists of solving, up to accuracy ε k, the sub-problems Main challenges G k (x) = F(x)+ κ 2 x y k 2 2, Find the right estimate sequence that is, the right lower-bound. Introduce the right definition of approximate estimate sequence. Control the accumulation of errors (ε k ) t 0. Control the complexity of solving the sub-problems with M. Differences with [Güler, 1992] different estimate sequence and more rigorous convergence analysis of the outer-loop algorithm. weaker inexactness criterion which allows us to control the complexity of the inner-loop algorithm. Julien Mairal, Inria Catalyst 26/58

31 Catalyst, the algorithm Algorithm 1 Catalyst input initial estimate x 0 R p, parameters κ and α 0, sequence (ε k ) t 0, optimization method M; initialize q = µ/(µ+κ) and y 0 = x 0 ; 1: while the desired stopping criterion is not satisfied do 2: Find an approx. solution x k using M s.t. G k (x k ) Gk ε k { x k argmin G t (x) = F(x)+ κ x R p 2 x y k 1 2} 3: Compute α k (0,1) from equation αk 2 = (1 α k)αk 1 2 +qα k; 4: Compute y k = x k +β k (x k x k 1 ) with β k = α k 1(1 α k 1 ) αk 1 2 +α. k 5: end while output x k (final estimate). Julien Mairal, Inria Catalyst 27/58

32 Analysis for µ-strongly convex objective functions Convergence of outer-loop algorithm Choose α 0 = q with q = µ/(µ+κ) and ǫ k = 2 9 (F(x 0) F )(1 ρ) k with ρ < q. Then, the algorithm generates iterates (x k ) t 0 such that F(x k ) F C(1 ρ) k+1 (F(x 0 ) F ) with C = Remarks Choice of ρ can safely be set to ρ = 0.9 q in practice. 8 ( q ρ) 2. Choice of (ε k ) k 0 typically follows from a duality gap at x 0. When F is non-negative, set ε k = (2/9)F(x 0 )(1 ρ) k. Julien Mairal, Inria Catalyst 28/58

33 Analysis for µ-strongly convex objective functions Control of inner-loop complexity Under the same assumptions, consider a method M generating iterates (z t ) t 0 for minimizing G k with linear convergence rate G k (z t ) G k A(1 τ M) t (G k (z 0 ) G k ). When z 0 = x k 1, the precision ε k is reached with a number of iterations T M = Õ(1/τ M ), where the notation Õ hides some universal constants and some logarithmic dependencies in µ and κ. Julien Mairal, Inria Catalyst 29/58

34 Analysis for µ-strongly convex objective functions Global computational complexity By calling F s the objective function value obtained after performing s = kt M iterations of the method M, the true convergence rate of the accelerated algorithm A is F s F = F ( ) x s F C T M ( 1 ρ ) s (F(x 0 ) F ). (2) T M Algorithm A has a global linear rate of convergence with parameter τ A,F = ρ/t M = Õ(τ M µ/ µ+κ), where τ M typically depends on κ (the greater, the faster is M). κ will be chosen to maximize the ratio τ M / µ+κ. Julien Mairal, Inria Catalyst 30/58

35 Analysis for µ-strongly convex objective functions Global computational complexity By calling F s the objective function value obtained after performing s = kt M iterations of the method M, the true convergence rate of the accelerated algorithm A is F s F = F ( ) x s F C T M ( 1 ρ ) s (F(x 0 ) F ). (2) T M Algorithm A has a global linear rate of convergence with parameter τ A,F = ρ/t M = Õ(τ M µ/ µ+κ), where τ M typically depends on κ (the greater, the faster is M). ) e.g., κ = L 2µ when τ M = µ+κ L+κ τ A = Õ( µ L. Julien Mairal, Inria Catalyst 30/58

36 Analysis for non-strongly convex objective functions Convergence of outer-loop algorithm,µ = 0 Choose α 0 = ( 5 1)/2 and ǫ k = 2(F(x 0) F ) 9(k +2) 4+η with η > 0. Then, the meta-algorithm generates iterates (x k ) k 0 such that F(x k ) F 8 (k +2) 2 ( ( 1+ 2 ) ) 2 (F(x 0 ) F )+ κ η 2 x 0 x 2. (3) Remarks the meta-algorithm achieves the optimal rate of convergence of first-order methods; this result disproves a conjecture of Salzo and Villa [2012]. Julien Mairal, Inria Catalyst 31/58

37 Analysis for non-strongly convex objective functions Control of inner-loop complexity Assume that F has bounded level sets. Under the same assumptions, let us consider a method M generating iterates (z t ) t 0 for minimizing the function G k with linear convergence rate: G k (z t ) G k A(1 τ M) t (G k (z 0 ) G k ). Then, there exists T M = Õ(1/τ M ), such that for any k 1, solving G k with initial point x k 1 requires at most T M log(k +2) iterations of M. Julien Mairal, Inria Catalyst 32/58

38 Analysis for non-strongly convex objective functions Global computational complexity To produce x k, M is called at most kt M log(k +2) times. Using the global iteration counter s = kt M log(k +2), we get ( ( F s F 8T2 M log2 (s) s ) ) 2 (F(x 0 ) F )+ κ η 2 x 0 x 2. If M is a first-order method, this rate is near-optimal, up to a logarithmic factor, when compared to the optimal rate O(1/s 2 ), which may be the price to pay for using a generic acceleration scheme. κ will be chosen to maximize the ratio τ M / L+κ Julien Mairal, Inria Catalyst 33/58

39 Catalyst in practice General strategy and application to randomized algorithms Calculating the iteration-complexity decomposes into three steps: 1 When F is µ-strongly convex, find κ that maximizes the ratio τ M,Gk / µ+κ for algorithm M. When F is non-strongly convex, maximize instead the ratio τ M,Gk / L+κ. 2 Compute the upper-bound of the number of outer iterations k out using the theorems. 3 Compute the upper-bound of the expected number of inner iterations max k=1,...,k out E[T M,Gk (ε k )] k in, Then, the expected iteration-complexity denoted Comp. is given by Comp k in k out. Julien Mairal, Inria Catalyst 34/58

40 Applications Deterministic and Randomized Incremental Gradient methods Stochastic Average Gradient (SAG and SAGA) [Schmidt et al., 2013, Defazio et al., 2014a]; Finito and MISO [Mairal, 2015, Defazio et al., 2014b]; Semi-Stochastic/Mixed Gradient [Konečnỳ et al., 2014, Zhang et al., 2013]; Stochastic Dual coordinate Ascent [Shalev-Shwartz and Zhang, 2012]; Stochastic Variance Reduced Gradient [Xiao and Zhang, 2014]. But also, randomized coordinate descent methods, and more generally first-order methods with linear convergence rates. Julien Mairal, Inria Catalyst 35/58

41 Applications Expected computational complexity in the regime n L/µ when µ > 0, FG SAG SAGA Finito/MISO SDCA SVRG Acc-FG Acc-SDCA µ > 0 µ = 0 Catalyst µ > 0 Cat. µ = 0 ( ( ) L O n log ( ) ) ( 1 L µ ε O ( Õ n ) log( ) ) 1 µ ε n L ε ( ( ) L O log( 1 Õ n L µ ε) ) ( nl Õ log( ) ) 1 µ ε ε ( L NA O log( 1 µ ε) ) Õ ( L O n log( 1 µ ε) ) ( ) O n L ( ε nl Õ log( 1 µ ε) ) NA ( nl log( ) ) 1 µ ε no acceleration Julien Mairal, Inria Catalyst 36/58

42 Experiments with MISO/SAG/SAGA Restarting The theory tells us to restart M with x k 1. For SDCA/Finito/MISO, the theory tells us to use instead x k 1 + κ µ+κ (y k 1 y k 2 ). We also tried this as a heuristic for SAG and SAGA. One-pass heuristic constrain M to always perform at most n iterations; we call this variant AMISO2 for MISO, whereas AMISO1 refers to the regular vanilla accelerated variant; idem to accelerate SAG and SAGA. Julien Mairal, Inria Catalyst 37/58

43 Experiments with MISO/SAG/SAGA l 2 -logistic regression formulation Given some data (y i,z i ), with y i in { 1,+1} and z i in R p, minimize 1 min x R p n n log(1+e y ix z i )+ µ 2 x 2 2, i=1 µ is the regularization parameter and the strong convexity modulus. Datasets name rcv1 real-sim covtype ocr alpha n p Julien Mairal, Inria Catalyst 38/58

44 Experiments without strong convexity, µ = 0 Objective function #Passes, Dataset covtype, µ = 0 Objective function x #Passes, Dataset real-sim, µ = Objective function #Passes, Dataset rcv1, µ=0 Objective function #Passes, Dataset alpha, µ=0 Objective function #Passes, Dataset ocr, µ=0 Figure : Objective function value for different number of passes on data. Conclusions SAG, SAGA are accelerated when they do not perform well already; AMISO2 AMISO1 (vanilla), MISO does not apply. Julien Mairal, Inria Catalyst 39/58

45 Experiments without strong convexity, µ = 10 1 /n Relative duality gap MISO AMISO1 AMISO2 SAG ASAG SAGA ASAGA Relative duality gap MISO AMISO1 AMISO2 SAG ASAG SAGA ASAGA Relative duality gap #Passes, Dataset covtype, µ/l =10 1 /n #Passes, Dataset real-sim, µ/l =10 1 /n #Passes, Dataset rcv1, µ/l =10 1 /n Relative duality gap #Passes, Dataset alpha, µ/l =10 1 /n Relative duality gap #Passes, Dataset ocr, µ/l =10 1 /n Figure : Relative duality gap (log-scale) for different number of passes on data. Conclusions SAG, SAGA are not always accelerated, but often. AMISO2,AMISO1 MISO. Julien Mairal, Inria Catalyst 40/58

46 Experiments without strong convexity, µ = 10 3 /n Relative duality gap #Passes, Dataset covtype, µ/l =10 3 /n Relative duality gap #Passes, Dataset real-sim, µ/l =10 3 /n Relative duality gap #Passes, Dataset rcv1, µ/l =10 3 /n Relative duality gap #Passes, Dataset alpha, µ/l =10 3 /n Relative duality gap #Passes, Dataset ocr, µ/l =10 3 /n Figure : Relative duality gap (log-scale) for different number of passes on data. Conclusions same conclusions as µ = 10 1 /n; µ is so small that (unaccelerated) MISO becomes numerically unstable. Julien Mairal, Inria Catalyst 41/58

47 General conclusions about Catalyst Summary: lots of nice features Simple acceleration scheme with broad application range. Recover near-optimal rates for known algorithms. Effortless implementation.... but also lots of unsolved problems Acceleration occurs when n L/µ; otherwise, the unaccelerated complexity O(n log(1/ε)) seems unbeatable. µ is an estimate of the true strong convexity parameter µ µ. µ is the global strong convexity parameter, not a local one µ µ. When n L/µ, but n L/(µ or µ ), a method M that adapts to the unknown strong convexity may be impossible to accelerate. The optimal restart for M is not yet fully understood. Julien Mairal, Inria Catalyst 42/58

48 Ideas of the proofs Main theorem Let us denote k 1 λ k = (1 α i ), (4) i=0 where the α i s are defined in Catalyst. Then, the sequence (x k ) k 0 satisfies ( Sk k ) 2 F(x k ) F ǫi λ k +2, (5) where F is the minimum value of F and i=1 λ i S k = F(x 0 ) F + γ 0 2 x 0 x 2 + where x is a minimizer of F. k ǫ i λ i i=1 where γ 0 = α 0((κ+µ)α 0 µ) 1 α 0, (6) Julien Mairal, Inria Catalyst 43/58

49 Ideas of the proofs Then, the theorem will be used with the following lemma to control the convergence rate of the sequence (λ k ) k 0, whose definition follows the classical use of estimate sequences. This will provide us convergence rates both for the strongly convex and non-strongly convex cases. Lemma from Nesterov [2004] If in the quantity γ 0 defined in (6) satisfies γ 0 µ, then the sequence (λ k ) k 0 from (4) satisfies λ k min (1 q) k 4, ( ) 2 2+k γ0, (7) κ+µ where q = µ/(µ+κ). Julien Mairal, Inria Catalyst 44/58

50 Ideas of proofs Step 1: build an approximate estimate sequence Remember that in general, we build ϕ k from ϕ k 1 as follows where d k is a lower bound. ϕ k (x) = (1 α k )ϕ k 1 (x)+α k d k (x), Here, a natural lower bound would be F(x) d k (x) = F(x k )+ κ(y k 1 x k ),x x k + µ 2 x x k 2, where x k { } = argmin x R p G k (x) = F(x)+ κ 2 x y k But xk is unknown! Then, use instead d k (x) defined as d k (x) = F(x k )+ κ(y k 1 x k ),x x k + µ 2 x x k 2. Julien Mairal, Inria Catalyst 45/58

51 Ideas of proofs Step 2: Relax the condition F(x k ) ϕ k. We can show that Catalyst generates iterates (x k ) k 0 such that F(x k ) φ k +ξ k, where the sequence (ξ k ) k 0 is defined by ξ 0 = 0 and ξ k = (1 α k 1 )(ξ k 1 +ε k (κ+µ) x k x k,x k 1 x k ). The sequences (α k ) k 0 and (y k ) k 0 are chosen in such a way that all the terms involving y k 1 x k are cancelled. We will control later the quantity x k x k by strong convexity of G k: κ+µ x k xk G k (x k ) Gk ε k. Julien Mairal, Inria Catalyst 46/58

52 Ideas of proofs Step 3: Control how this errors sum up together. Do cumbersome calculus. Julien Mairal, Inria Catalyst 47/58

53 Outline Part I: Catalyst Part II: Proximal MISO Julien Mairal, Inria Catalyst 48/58

54 Original motivation Given some data, learn some model parameters x in R p by minimizing { } min F(x) = 1 n f i (x), x R p n i=1 where each f i may be nonsmooth and nonconvex. The original MISO algorithm is an incremental extension of the majorization-minimization principle [Lange et al., 2000]. Publications J. Mairal. Incremental Majorization-Minimization Optimization with Application to Large-Scale Machine Learning. SIAM Journal on Optimization J. Mairal. Optimization with First-Order Surrogate Functions. ICML Julien Mairal, Inria Catalyst 49/58

55 Majorization-minimization principle g(θ) κ f(θ) Iteratively minimize locally tight upper bounds of the objective. The objective monotonically decreases. Under some assumptions, we get similar convergence rates as gradient-based approaches for convex problems. Julien Mairal, Inria Catalyst 50/58

56 Incremental optimization: MISO Algorithm 2 Incremental scheme MISO input x 0 R p ; T (number of iterations). 1: Choose surrogates g 0 i of f i near x 0 for all i; 2: for k = 1,...,K do 3: Randomly pick up one index î k and choose a surrogate g k î k of fîk near x k 1. Set gi k = g k 1 i for i î k ; 4: Update the solution: 5: end for output x K (final estimate); 1 x k argmin x R p n n i=1 g k i (x). Julien Mairal, Inria Catalyst 51/58

57 Incremental Optimization: MISO Update rule with basic upper bounds We want to minimize 1 n n i=1 f i(x), where the f i s are smooth. 1 x k argmin x R p n n = 1 n i=1 n i=1 y k i 1 Ln f i (yi k )+ f i (yi k ) (x yi k )+ L x yk i n i=1 f i (y k i ). At iteration k, randomly draw one index î k, and update y k î k x k. Remarks replace (1/n) n i=1 yk i by x k 1 yields SAG [Schmidt et al., 2013]. replace (1/L) by (1/µ) for strongly convex problems is close to a variant of SDCA [Shalev-Shwartz and Zhang, 2012]. Julien Mairal, Inria Catalyst 52/58

58 Incremental Optimization: MISOµ. Update rule with lower bounds??? We want to minimize 1 n n i=1 f i(x), where the f i s are smooth. 1 x k = argmin x R p n n = 1 n i=1 n i=1 y k i 1 µn f i (yi k )+ f i (yi k ) (x yi k )+ µ x yk i n i=1 Remarks Requires strong convexity. f i (y k i ). Use a counter-intuitive minorization-minimization principle. Close to a variant of SDCA [Shalev-Shwartz and Zhang, 2012]. Much faster than the basic MISO (faster rate). Julien Mairal, Inria Catalyst 53/58

59 Incremental Optimization: MISOµ. In the first part of this presentation, what we have called MISO is the algorithm that uses 1/(µn) step-sizes (sorry for the confusion). To minimize F(x) = 1 n n i=1 f i(x), MISOµ has the following guarantees Proposition [Mairal, 2015] When the functions f i are µ-strongly convex, differentiable with L-Lipschitz gradient, and non-negative, MISOµ satisfies E[F(x k ) F ] under the condition n 2L/µ. Remarks ( 1 1 ) k nf, 3n When n 2L/µ, the algorithm may diverge; When µ is very small, numerical stability is an issue. The condition f i 0 does not really matter. Julien Mairal, Inria Catalyst 54/58

60 Proximal MISO [Lin, Mairal, and Harchaoui, 2015] Main goals Remove the condition n 2L/µ; Allow a composite term ψ: { Starting points min x R p F(x) = 1 n } n f i (x)+ψ(x), MISOµ is iteratively updating/minimizing a lower-bound of F { } x k argmin D k (x) = 1 n di k (x), x R p n i=1 i=1 [Lin, Mairal, and Harchaoui, 2015]. Julien Mairal, Inria Catalyst 55/58

61 Proximal MISO Adding the proximal term { x t argmin x R p D k (x) = 1 n n i=1 d k i (x)+ψ(x) }, Remove the condition n 2L/µ For i = î k, d k i (x)=(1 δ)d k 1 i (x)+δ (f i (x k 1 )+ f i (x k 1 ),x x k 1 + µ 2 x x k 1 2) Remarks the original MISOµ uses δ = 1. To get rid of the( condition n 2L/µ, proximal MISO uses instead δ = min 1, µn 2(L µ) variant 5 of SDCA [Shalev-Shwartz and Zhang, 2012] is identical with another value δ = µn L+µn in (0,1). Julien Mairal, Inria Catalyst 56/58 ).

62 Proximal MISO Convergence of MISO-Prox Let (x k ) k 0 be obtained by MISO-Prox, then E[F(x k )] F 1 { µ τ (1 τ)k+1 (F(x 0 ) D 0 (x 0 )) with τ min 4L, 1 }. 2n (8) Furthermore, we also have fast convergence of the certificate E[F(x k ) D k (x k )] 1 τ (1 τ)k (F D 0 (x 0 )). Differences with SDCA The construction is primal. The proof of convergence and the algorithm do not use duality, while SDCA is a dual ascent technique. D k (x k ) is a lower-bound of F ; it plays the same role as the dual in SDCA, but is easier to evaluate. Julien Mairal, Inria Catalyst 57/58

63 Conclusions Relatively simple algorithm, with simple convergence proof, and simple optimality certificate. Catalyst not only accelerates it, but also stabilizes it numerically, with the parameter δ = 1. Close to SDCA, but without duality. Julien Mairal, Inria Catalyst 58/58

64 References I A. Agarwal and L. Bottou. A lower bound for the optimization of finite sums. In Proceedings of the International Conference on Machine Learning (ICML), A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1): , S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20:33 61, P. L. Combettes and V. R. Wajs. Signal recovery by proximal forward-backward splitting. SIAM Multiscale Modeling and Simulation, 4(4): , I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics, 57(11): , A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems (NIPS), 2014a. Julien Mairal, Inria Catalyst 59/58

65 References II A. J. Defazio, T. S. Caetano, and J. Domke. Finito: A faster, permutable incremental gradient method for big data problems. In Proceedings of the International Conference on Machine Learning (ICML), 2014b. R. Frostig, R. Ge, S. M. Kakade, and A. Sidford. Un-regularizing: approximate proximal point algorithms for empirical risk minimization. In Proceedings of the International Conference on Machine Learning (ICML), Dan Garber and Elad Hazan. Fast and simple pca via convex optimization. arxiv preprint arxiv: , O. Güler. New proximal point algorithms for convex minimization. SIAM Journal on Optimization, 2(4): , B. He and X. Yuan. An accelerated inexact proximal point algorithm for convex minimization. Journal of Optimization Theory and Applications, 154(2): , Jakub Konečnỳ, Jie Liu, Peter Richtárik, and Martin Takáč. ms2gd: Mini-batch semi-stochastic gradient descent in the proximal setting. arxiv preprint arxiv: , Julien Mairal, Inria Catalyst 60/58

66 References III Guanghui Lan. An optimal randomized incremental gradient method. arxiv preprint arxiv: , K. Lange, D. R. Hunter, and I. Yang. Optimization transfer using surrogate objective functions. Journal of Computational and Graphical Statistics, 9(1): 1 20, H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems, J. Mairal. Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM Journal on Optimization, 25(2): , Y. Nesterov. Introductory lectures on convex optimization: a basic course. Kluwer Academic Publishers, Y. Nesterov. Gradient methods for minimizing composite objective function. Mathematical Programming, 140(1): , Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o (1/k2). In Doklady an SSSR, volume 269, pages , Julien Mairal, Inria Catalyst 61/58

67 References IV R. D. Nowak and M. A. T. Figueiredo. Fast wavelet-based image deconvolution using the EM algorithm. In Conference Record of the Thirty-Fifth Asilomar Conference on Signals, Systems and Computers., S. Salzo and S. Villa. Inexact and accelerated proximal point algorithms. Journal of Convex Analysis, 19(4): , M. Schmidt, N. Le Roux, and F. Bach. Convergence rates of inexact proximal-gradient methods for convex optimization. In Advances in Neural Information Processing Systems (NIPS), M. Schmidt, N. Le Roux, and F. Bach. Minimizing finite sums with the stochastic average gradient. arxiv: , S. Shalev-Shwartz and T. Zhang. Proximal stochastic dual coordinate ascent. arxiv: , S. Shalev-Shwartz and T. Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming, pages 1 41, R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B, 58(1): , Julien Mairal, Inria Catalyst 62/58

68 References V S.J. Wright, R.D. Nowak, and M.A.T. Figueiredo. Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing, 57(7): , L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4): , Lijun Zhang, Mehrdad Mahdavi, and Rong Jin. Linear convergence with condition number independent access of full gradients. In Advances in Neural Information Processing Systems, pages , Y. Zhang and L. Xiao. Stochastic primal-dual coordinate method for regularized empirical risk minimization. In Proceedings of the International Conference on Machine Learning (ICML), Julien Mairal, Inria Catalyst 63/58

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19 Journal Club A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) CMAP, Ecole Polytechnique March 8th, 2018 1/19 Plan 1 Motivations 2 Existing Acceleration Methods 3 Universal

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Julien Mairal Inria, LEAR Team, Grenoble Journées MAS, Toulouse Julien Mairal Incremental and Stochastic

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Jinghui Chen Department of Systems and Information Engineering University of Virginia Quanquan Gu

More information

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization References --- a tentative list of papers to be mentioned in the ICML 2017 tutorial Recent Advances in Stochastic Convex and Non-Convex Optimization Disclaimer: in a quite arbitrary order. 1. [ShalevShwartz-Zhang,

More information

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Finite-sum Composition Optimization via Variance Reduced Gradient Descent Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu

More information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

arxiv: v1 [math.oc] 18 Mar 2016

arxiv: v1 [math.oc] 18 Mar 2016 Katyusha: Accelerated Variance Reduction for Faster SGD Zeyuan Allen-Zhu zeyuan@csail.mit.edu Princeton University arxiv:1603.05953v1 [math.oc] 18 Mar 016 March 18, 016 Abstract We consider minimizing

More information

Stochastic optimization: Beyond stochastic gradients and convexity Part I

Stochastic optimization: Beyond stochastic gradients and convexity Part I Stochastic optimization: Beyond stochastic gradients and convexity Part I Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint tutorial with Suvrit Sra, MIT - NIPS

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

Lecture 3: Minimizing Large Sums. Peter Richtárik

Lecture 3: Minimizing Large Sums. Peter Richtárik Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Nesterov s Acceleration

Nesterov s Acceleration Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization

Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization Tomoya Murata Taiji Suzuki More recently, several authors have proposed accelerarxiv:70300439v4

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

Recent Advances in Structured Sparse Models

Recent Advances in Structured Sparse Models Recent Advances in Structured Sparse Models Julien Mairal Willow group - INRIA - ENS - Paris 21 September 2010 LEAR seminar At Grenoble, September 21 st, 2010 Julien Mairal Recent Advances in Structured

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

ONLINE VARIANCE-REDUCING OPTIMIZATION

ONLINE VARIANCE-REDUCING OPTIMIZATION ONLINE VARIANCE-REDUCING OPTIMIZATION Nicolas Le Roux Google Brain nlr@google.com Reza Babanezhad University of British Columbia rezababa@cs.ubc.ca Pierre-Antoine Manzagol Google Brain manzagop@google.com

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

arxiv: v2 [math.oc] 25 Oct 2015

arxiv: v2 [math.oc] 25 Oct 2015 A Universal Catalyst for First-Order Optimization arxiv:506.086v [math.oc] 5 Oct 05 Hongzhou Lin, Julien Mairal and Zaid Harchaoui, Inria NYU {hongzhou.lin,julien.mairal}@inria.fr zaid.harchaoui@nyu.edu

More information

Optimization methods for large-scale machine learning and sparse estimation. Julien Mairal

Optimization methods for large-scale machine learning and sparse estimation. Julien Mairal Optimization methods for large-scale machine learning and sparse estimation Julien Mairal Inria Grenoble Nantes, Mascot-Num, 2018 Part II Julien Mairal Optim for large-scale ML and sparse estimation 1/71

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Lecture 23: November 21

Lecture 23: November 21 10-725/36-725: Convex Optimization Fall 2016 Lecturer: Ryan Tibshirani Lecture 23: November 21 Scribes: Yifan Sun, Ananya Kumar, Xin Lu Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

arxiv: v2 [stat.ml] 16 Jun 2015

arxiv: v2 [stat.ml] 16 Jun 2015 Semi-Stochastic Gradient Descent Methods Jakub Konečný Peter Richtárik arxiv:1312.1666v2 [stat.ml] 16 Jun 2015 School of Mathematics University of Edinburgh United Kingdom June 15, 2015 (first version:

More information

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University SIAM Conference on Imaging Science, Bologna, Italy, 2018 Adaptive FISTA Peter Ochs Saarland University 07.06.2018 joint work with Thomas Pock, TU Graz, Austria c 2018 Peter Ochs Adaptive FISTA 1 / 16 Some

More information

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization / Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods

More information

OWL to the rescue of LASSO

OWL to the rescue of LASSO OWL to the rescue of LASSO IISc IBM day 2018 Joint Work R. Sankaran and Francis Bach AISTATS 17 Chiranjib Bhattacharyya Professor, Department of Computer Science and Automation Indian Institute of Science,

More information

Stochastic gradient descent and robustness to ill-conditioning

Stochastic gradient descent and robustness to ill-conditioning Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based

More information

Topographic Dictionary Learning with Structured Sparsity

Topographic Dictionary Learning with Structured Sparsity Topographic Dictionary Learning with Structured Sparsity Julien Mairal 1 Rodolphe Jenatton 2 Guillaume Obozinski 2 Francis Bach 2 1 UC Berkeley 2 INRIA - SIERRA Project-Team San Diego, Wavelets and Sparsity

More information

1. Introduction. We consider the problem of minimizing the sum of two convex functions: = F (x)+r(x)}, f i (x),

1. Introduction. We consider the problem of minimizing the sum of two convex functions: = F (x)+r(x)}, f i (x), SIAM J. OPTIM. Vol. 24, No. 4, pp. 2057 2075 c 204 Society for Industrial and Applied Mathematics A PROXIMAL STOCHASTIC GRADIENT METHOD WITH PROGRESSIVE VARIANCE REDUCTION LIN XIAO AND TONG ZHANG Abstract.

More information

LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION

LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION Aryan Mokhtari, Alec Koppel, Gesualdo Scutari, and Alejandro Ribeiro Department of Electrical and Systems

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

arxiv: v5 [cs.lg] 23 Dec 2017 Abstract

arxiv: v5 [cs.lg] 23 Dec 2017 Abstract Nonconvex Sparse Learning via Stochastic Optimization with Progressive Variance Reduction Xingguo Li, Raman Arora, Han Liu, Jarvis Haupt, and Tuo Zhao arxiv:1605.02711v5 [cs.lg] 23 Dec 2017 Abstract We

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine

More information

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization Zheng Qu ZHENGQU@HKU.HK Department of Mathematics, The University of Hong Kong, Hong Kong Peter Richtárik PETER.RICHTARIK@ED.AC.UK School of Mathematics, The University of Edinburgh, UK Martin Takáč TAKAC.MT@GMAIL.COM

More information

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson 204 IEEE INTERNATIONAL WORKSHOP ON ACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 2 24, 204, REIS, FRANCE A DELAYED PROXIAL GRADIENT ETHOD WITH LINEAR CONVERGENCE RATE Hamid Reza Feyzmahdavian, Arda Aytekin,

More information

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation Zhouchen Lin Peking University April 22, 2018 Too Many Opt. Problems! Too Many Opt. Algorithms! Zero-th order algorithms:

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July

More information

Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization

Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization Yossi Arjevani Department of Computer Science and Applied Mathematics Weizmann Institute of Science Rehovot 7610001,

More information

Coordinate Descent Faceoff: Primal or Dual?

Coordinate Descent Faceoff: Primal or Dual? JMLR: Workshop and Conference Proceedings 83:1 22, 2018 Algorithmic Learning Theory 2018 Coordinate Descent Faceoff: Primal or Dual? Dominik Csiba peter.richtarik@ed.ac.uk School of Mathematics University

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Variance-Reduced and Projection-Free Stochastic Optimization

Variance-Reduced and Projection-Free Stochastic Optimization Elad Hazan Princeton University, Princeton, NJ 08540, USA Haipeng Luo Princeton University, Princeton, NJ 08540, USA EHAZAN@CS.PRINCETON.EDU HAIPENGL@CS.PRINCETON.EDU Abstract The Frank-Wolfe optimization

More information

On the Iteration Complexity of Oblivious First-Order Optimization Algorithms

On the Iteration Complexity of Oblivious First-Order Optimization Algorithms On the Iteration Complexity of Oblivious First-Order Optimization Algorithms Yossi Arjevani Weizmann Institute of Science, Rehovot 7610001, Israel Ohad Shamir Weizmann Institute of Science, Rehovot 7610001,

More information

Sparsity Regularization

Sparsity Regularization Sparsity Regularization Bangti Jin Course Inverse Problems & Imaging 1 / 41 Outline 1 Motivation: sparsity? 2 Mathematical preliminaries 3 l 1 solvers 2 / 41 problem setup finite-dimensional formulation

More information

On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression

On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression arxiv:606.03000v [cs.it] 9 Jun 206 Kobi Cohen, Angelia Nedić and R. Srikant Abstract The problem

More information

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Zhaosong Lu Lin Xiao June 25, 2013 Abstract In this paper we propose a randomized block coordinate non-monotone

More information

Active-set complexity of proximal gradient

Active-set complexity of proximal gradient Noname manuscript No. will be inserted by the editor) Active-set complexity of proximal gradient How long does it take to find the sparsity pattern? Julie Nutini Mark Schmidt Warren Hare Received: date

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and

More information

Proximal average approximated incremental gradient descent for composite penalty regularized empirical risk minimization

Proximal average approximated incremental gradient descent for composite penalty regularized empirical risk minimization Mach earn 207 06:595 622 DOI 0.007/s0994-06-5609- Proximal average approximated incremental gradient descent for composite penalty regularized empirical risk minimization Yiu-ming Cheung Jian ou Received:

More information

An interior-point stochastic approximation method and an L1-regularized delta rule

An interior-point stochastic approximation method and an L1-regularized delta rule Photograph from National Geographic, Sept 2008 An interior-point stochastic approximation method and an L1-regularized delta rule Peter Carbonetto, Mark Schmidt and Nando de Freitas University of British

More information

Adaptive restart of accelerated gradient methods under local quadratic growth condition

Adaptive restart of accelerated gradient methods under local quadratic growth condition Adaptive restart of accelerated gradient methods under local quadratic growth condition Olivier Fercoq Zheng Qu September 6, 08 arxiv:709.0300v [math.oc] 7 Sep 07 Abstract By analyzing accelerated proximal

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

Convergence Rate of Expectation-Maximization

Convergence Rate of Expectation-Maximization Convergence Rate of Expectation-Maximiation Raunak Kumar University of British Columbia Mark Schmidt University of British Columbia Abstract raunakkumar17@outlookcom schmidtm@csubcca Expectation-maximiation

More information

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.

More information

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč School of Mathematics, University of Edinburgh, Edinburgh, EH9 3JZ, United Kingdom

More information

Linearly-Convergent Stochastic-Gradient Methods

Linearly-Convergent Stochastic-Gradient Methods Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure

More information

arxiv: v1 [math.oc] 10 Oct 2018

arxiv: v1 [math.oc] 10 Oct 2018 8 Frank-Wolfe Method is Automatically Adaptive to Error Bound ondition arxiv:80.04765v [math.o] 0 Oct 08 Yi Xu yi-xu@uiowa.edu Tianbao Yang tianbao-yang@uiowa.edu Department of omputer Science, The University

More information

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve

More information

Adaptive Probabilities in Stochastic Optimization Algorithms

Adaptive Probabilities in Stochastic Optimization Algorithms Research Collection Master Thesis Adaptive Probabilities in Stochastic Optimization Algorithms Author(s): Zhong, Lei Publication Date: 2015 Permanent Link: https://doi.org/10.3929/ethz-a-010421465 Rights

More information

Primal-dual coordinate descent

Primal-dual coordinate descent Primal-dual coordinate descent Olivier Fercoq Joint work with P. Bianchi & W. Hachem 15 July 2015 1/28 Minimize the convex function f, g, h convex f is differentiable Problem min f (x) + g(x) + h(mx) x

More information

Introduction to Three Paradigms in Machine Learning. Julien Mairal

Introduction to Three Paradigms in Machine Learning. Julien Mairal Introduction to Three Paradigms in Machine Learning Julien Mairal Inria Grenoble Yerevan, 208 Julien Mairal Introduction to Three Paradigms in Machine Learning /25 Optimization is central to machine learning

More information

Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization

Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization Xiaocheng Tang Department of Industrial and Systems Engineering Lehigh University Bethlehem, PA 18015 xct@lehigh.edu Katya Scheinberg

More information

Mini-Batch Primal and Dual Methods for SVMs

Mini-Batch Primal and Dual Methods for SVMs Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Adaptive Restarting for First Order Optimization Methods

Adaptive Restarting for First Order Optimization Methods Adaptive Restarting for First Order Optimization Methods Nesterov method for smooth convex optimization adpative restarting schemes step-size insensitivity extension to non-smooth optimization continuation

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

The basic problem of interest in this paper is the convex programming (CP) problem given by

The basic problem of interest in this paper is the convex programming (CP) problem given by Noname manuscript No. (will be inserted by the editor) An optimal randomized incremental gradient method Guanghui Lan Yi Zhou the date of receipt and acceptance should be inserted later arxiv:1507.02000v3

More information

A Proximal Stochastic Gradient Method with Progressive Variance Reduction

A Proximal Stochastic Gradient Method with Progressive Variance Reduction A Proximal Stochastic Gradient Method with Progressive Variance Reduction arxiv:403.4699v [math.oc] 9 Mar 204 Lin Xiao Tong Zhang March 8, 204 Abstract We consider the problem of minimizing the sum of

More information

Lecture 8: February 9

Lecture 8: February 9 0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we

More information

arxiv: v2 [cs.lg] 14 Sep 2017

arxiv: v2 [cs.lg] 14 Sep 2017 Elad Hazan Princeton University, Princeton, NJ 08540, USA Haipeng Luo Princeton University, Princeton, NJ 08540, USA EHAZAN@CS.PRINCETON.EDU HAIPENGL@CS.PRINCETON.EDU arxiv:160.0101v [cs.lg] 14 Sep 017

More information