Large-scale machine learning and convex optimization

Size: px
Start display at page:

Download "Large-scale machine learning and convex optimization"

Transcription

1 Large-scale machine learning and convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE IFCAM, Bangalore - July 2014

2 Big data revolution? A new scientific context Data everywhere: size does not (always) matter Science and industry Size and variety Learning from examples n observations in dimension d

3 Search engines - advertising

4 Search engines - Advertising

5 Marketing - Personalized recommendation

6 Visual object recognition

7 Personal photos

8 Bioinformatics Protein: Crucial elements of cell life Massive data: 2 millions for humans Complex data

9 Context Machine learning for big data Large-scale machine learning: large d, large n d : dimension of each observation (input) n : number of observations Examples: computer vision, bioinformatics, advertising Ideal running-time complexity: O(dn) Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951) Mixing statistics and optimization Using smoothness to go beyond stochastic gradient descent

10 Context Machine learning for big data Large-scale machine learning: large d, large n d : dimension of each observation (input) n : number of observations Examples: computer vision, bioinformatics, advertising Ideal running-time complexity: O(dn) Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951) Mixing statistics and optimization Using smoothness to go beyond stochastic gradient descent

11 Context Machine learning for big data Large-scale machine learning: large d, large n d : dimension of each observation (input) n : number of observations Examples: computer vision, bioinformatics, advertising Ideal running-time complexity: O(dn) Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951) Mixing statistics and optimization Using smoothness to go beyond stochastic gradient descent

12 Outline 1. Large-scale machine learning and optimization Traditional statistical analysis Classical methods for convex optimization 2. Non-smooth stochastic approximation Stochastic (sub)gradient and averaging Non-asymptotic results and lower bounds Strongly convex vs. non-strongly convex 3. Smooth stochastic approximation algorithms Asymptotic and non-asymptotic results 4. Beyond decaying step-sizes 5. Finite data sets

13 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ Φ(x) of features Φ(x) R d (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i,θ Φ(x i ) ) + µω(θ) θ R d n i=1 convex data fitting term + regularizer

14 Usual losses Regression: y R, prediction ŷ = θ Φ(x) quadratic loss 1 2 (y ŷ)2 = 1 2 (y θ Φ(x)) 2

15 Usual losses Regression: y R, prediction ŷ = θ Φ(x) quadratic loss 1 2 (y ŷ)2 = 1 2 (y θ Φ(x)) 2 Classification : y { 1,1}, prediction ŷ = sign(θ Φ(x)) loss of the form l(yθ Φ(x)) True 0-1 loss: l(yθ Φ(x)) = 1 yθ Φ(x)<0 Usual convex losses: hinge square logistic

16 Main motivating examples Support vector machine (hinge loss) l(y,θ Φ(X)) = max{1 Yθ Φ(X),0} Logistic regression l(y,θ Φ(X)) = log(1+exp( Yθ Φ(X))) Least-squares regression l(y,θ Φ(X)) = 1 2 (Y θ Φ(X)) 2

17 Usual regularizers Main goal: avoid overfitting (squared) Euclidean norm: θ 2 2 = d j=1 θ j 2 Numerically well-behaved Representer theorem and kernel methods : θ = n i=1 α iφ(x i ) See, e.g., Schölkopf and Smola (2001); Shawe-Taylor and Cristianini (2004) Sparsity-inducing norms Main example: l 1 -norm θ 1 = d j=1 θ j Perform model selection as well as regularization Non-smooth optimization and structured sparsity See, e.g., Bach, Jenatton, Mairal, and Obozinski (2011, 2012)

18 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ Φ(x) of features Φ(x) R d (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i,θ Φ(x i ) ) + µω(θ) θ R d n i=1 convex data fitting term + regularizer

19 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ Φ(x) of features Φ(x) R d (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i,θ Φ(x i ) ) + µω(θ) θ R d n i=1 convex data fitting term + regularizer Empirical risk: ˆf(θ) = 1 n n i=1 l(y i,θ Φ(x i )) training cost Expected risk: f(θ) = E (x,y) l(y,θ Φ(x)) testing cost Two fundamental questions: (1) computing ˆθ and (2) analyzing ˆθ May be tackled simultaneously

20 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ Φ(x) of features Φ(x) R d (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i,θ Φ(x i ) ) + µω(θ) θ R d n i=1 convex data fitting term + regularizer Empirical risk: ˆf(θ) = 1 n n i=1 l(y i,θ Φ(x i )) training cost Expected risk: f(θ) = E (x,y) l(y,θ Φ(x)) testing cost Two fundamental questions: (1) computing ˆθ and (2) analyzing ˆθ May be tackled simultaneously

21 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ Φ(x) of features Φ(x) R d (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i,θ Φ(x i ) ) such that Ω(θ) D θ R d n i=1 convex data fitting term + constraint Empirical risk: ˆf(θ) = 1 n n i=1 l(y i,θ Φ(x i )) training cost Expected risk: f(θ) = E (x,y) l(y,θ Φ(x)) testing cost Two fundamental questions: (1) computing ˆθ and (2) analyzing ˆθ May be tackled simultaneously

22 General assumptions Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Bounded features Φ(x) R d : Φ(x) 2 R Empirical risk: ˆf(θ) = 1 n n i=1 l(y i,θ Φ(x i )) training cost Expected risk: f(θ) = E (x,y) l(y,θ Φ(x)) testing cost Loss for a single observation: f i (θ) = l(y i,θ Φ(x i )) i, f(θ) = Ef i (θ) Properties of f i,f, ˆf Convex on R d Additional regularity assumptions: Lipschitz-continuity, smoothness and strong convexity

23 Lipschitz continuity Bounded gradients of f (Lipschitz-continuity): the function f if convex, differentiable and has (sub)gradients uniformly bounded by B on the ball of center 0 and radius D: Machine learning θ R d, θ 2 D f (θ) 2 B with f(θ) = 1 n n i=1 l(y i,θ Φ(x i )) G-Lipschitz loss and R-bounded data: B = GR

24 Smoothness and strong convexity A function f : R d R is L-smooth if and only if it is differentiable and its gradient is L-Lipschitz-continuous θ 1,θ 2 R d, f (θ 1 ) f (θ 2 ) 2 L θ 1 θ 2 2 If f is twice differentiable: θ R d, f (θ) L Id smooth non smooth

25 Smoothness and strong convexity A function f : R d R is L-smooth if and only if it is differentiable and its gradient is L-Lipschitz-continuous θ 1,θ 2 R d, f (θ 1 ) f (θ 2 ) 2 L θ 1 θ 2 2 If f is twice differentiable: θ R d, f (θ) L Id Machine learning with f(θ) = 1 n n i=1 l(y i,θ Φ(x i )) Hessian covariance matrix 1 n n i=1 Φ(x i)φ(x i ) l-smooth loss and R-bounded data: L = lr 2

26 Smoothness and strong convexity A function f : R d R is µ-strongly convex if and only if θ 1,θ 2 R d, f(θ 1 ) f(θ 2 )+f (θ 2 ) (θ 1 θ 2 )+ µ 2 θ 1 θ If f is twice differentiable: θ R d, f (θ) µ Id convex strongly convex

27 Smoothness and strong convexity A function f : R d R is µ-strongly convex if and only if θ 1,θ 2 R d, f(θ 1 ) f(θ 2 )+f (θ 2 ) (θ 1 θ 2 )+ µ 2 θ 1 θ If f is twice differentiable: θ R d, f (θ) µ Id Machine learning with f(θ) = 1 n n i=1 l(y i,θ Φ(x i )) Hessian covariance matrix 1 n n i=1 Φ(x i)φ(x i ) Data with invertible covariance matrix (low correlation/dimension)

28 Smoothness and strong convexity A function f : R d R is µ-strongly convex if and only if θ 1,θ 2 R d, f(θ 1 ) f(θ 2 )+f (θ 2 ) (θ 1 θ 2 )+ µ 2 θ 1 θ If f is twice differentiable: θ R d, f (θ) µ Id Machine learning with f(θ) = 1 n n i=1 l(y i,θ Φ(x i )) Hessian covariance matrix 1 n n i=1 Φ(x i)φ(x i ) Data with invertible covariance matrix (low correlation/dimension) Adding regularization by µ 2 θ 2 creates additional bias unless µ is small

29 Summary of smoothness/convexity assumptions Bounded gradients of f (Lipschitz-continuity): the function f if convex, differentiable and has (sub)gradients uniformly bounded by B on the ball of center 0 and radius D: θ R d, θ 2 D f (θ) 2 B Smoothness of f: the function f is convex, differentiable with L-Lipschitz-continuous gradient f : θ 1,θ 2 R d, f (θ 1 ) f (θ 2 ) 2 L θ 1 θ 2 2 Strong convexity of f: The function f is strongly convex with respect to the norm, with convexity constant µ > 0: θ 1,θ 2 R d, f(θ 1 ) f(θ 2 )+f (θ 2 ) (θ 1 θ 2 )+ µ 2 θ 1 θ 2 2 2

30 Analysis of empirical risk minimization Approximation and estimation errors: C = {θ R d,ω(θ) D} [ ] [ ] f(ˆθ) min θ R df(θ) = f(ˆθ) min f(θ) + minf(θ) min θ C θ C θ R df(θ) NB: may replace min by best (non-linear) predictions θ Rdf(θ) 1. Uniform deviation bounds, with ˆθ argmin θ C ˆf(θ) f(ˆθ) min θ C f(θ) 2sup θ C Typically slow rate O ( 1 n ) ˆf(θ) f(θ) (proof) 2. More refined concentration results with faster rates

31 Motivation from least-squares For least-squares, we have l(y,θ Φ(x)) = 1 2 (y θ Φ(x)) 2, and sup θ 2 D f(θ) ˆf(θ) = 1 ( 1 n 2 θ Φ(x i )Φ(x i ) EΦ(X)Φ(X) )θ n i=1 ( 1 n ) θ y i Φ(x i ) EYΦ(X) + 1 ( 1 n yi 2 EY ), 2 n 2 n i=1 i=1 1 n i )Φ(x i ) n i=1φ(x EΦ(X)Φ(X) op +D 1 n y i Φ(x i ) EYΦ(X) n n 2 2 yi 2 EY 2, n f(θ) ˆf(θ) D2 2 i=1 sup f(θ) ˆf(θ) O(1/ n) with high probability θ 2 D i=1

32 Slow rate for supervised learning Assumptions (f is the expected risk, ˆf the empirical risk) Ω(θ) = θ 2 (Euclidean norm) Linear predictors: θ(x) = θ Φ(x), with Φ(x) 2 R a.s. G-Lipschitz loss: f and ˆf are GR-Lipschitz on C = { θ 2 D} No assumptions regarding convexity

33 Slow rate for supervised learning Assumptions (f is the expected risk, ˆf the empirical risk) Ω(θ) = θ 2 (Euclidean norm) Linear predictors: θ(x) = θ Φ(x), with Φ(x) 2 R a.s. G-Lipschitz loss: f and ˆf are GR-Lipschitz on C = { θ 2 D} No assumptions regarding convexity With probability greater than 1 δ sup θ C ˆf(θ) f(θ) GRD n [2+ Expectated estimation error: E [ sup θ C 2log 2 δ ] ˆf(θ) f(θ) ] 4GRD n Using Rademacher averages (see, e.g., Boucheron et al., 2005) Lipschitz functions slow rate

34 Symmetrization with Rademacher variables Let D = {x 1,y 1,...,x n,y n} an independent copy of the data D = {x 1,y 1,...,x n,y n }, with corresponding loss functions f i (θ) E [ sup f(θ) ˆf(θ) ] = E [ ( sup f(θ) 1 n ) ] f i (θ) θ Θ θ Θ n [ i=1 = E sup 1 n E ( f θ Θ n i(θ) f i (θ) D ) ] i=1 [ [ E E sup 1 n ]] ( f θ Θ n i (θ) f i (θ) D i=1 [ = E sup 1 n ( f θ Θ n i (θ) f i (θ) ) ] i=1 [ = E sup 1 n ( ε i f θ Θ n i (θ) f i (θ) ) ] with ε i uniform in { 1,1} i=1 [ 2E 1 n ] ε i f i (θ) n = Rademacher complexity sup θ Θ i=1

35 Rademacher complexity DefinetheRademachercomplexity of theclass of functions(x,y) l(y,θ Φ(X)) as [ R n = E sup θ Θ 1 n n ] ε i f i (θ). Note two expectations, with respect to D and with respect to ε i=1 Main property: E [ ] sup f(θ) ˆf(θ) 2Rn θ Θ

36 From Rademacher complexity to uniform bound Let Z = sup θ Θ f(θ) ˆf(θ) By changing the pair (x i,y i ), Z may only change by 2 n sup l(y,θ Φ(X)) 2 ( ) 2( sup l(y,0) +GRD l0 +GRD ) = c n n with sup l(y,0) = l 0 MacDiarmid inequality: with probability greater than 1 δ, Z EZ + n 2 c log 1 δ 2R n+ 2 n ( l0 +GRD ) log 1 δ

37 Bounding the Rademacher average - I Wehave,withϕ i (u) = l(y i,u) l(y i,0)isalmostsurelyb-lipschitz: [ R n = E sup 1 θ Θ n [ E sup 1 θ Θ n l [ 0 +E n n ] ε i f i (θ) i=1 n ] ε i f i (0) i=1 sup θ Θ = l [ 0 +E sup n θ Θ 1 n 1 n [ +E sup θ Θ 1 n n [ ε i fi (θ) f i (0) ] ] i=1 n [ ε i fi (θ) f i (0) ]] i=1 n i=1 ] ε i ϕ i (θ Φ(x i )) Using Ledoux-Talagrand concentration results for Rademacher averages (since ϕ i is G-Lipschitz, we get: R n l [ 0 +2G E sup 1 n θ 2 D n n ε i θ Φ(x i )] i=1

38 We have: Bounding the Rademacher average - II R n l [ 0 +2GE sup 1 n n ε i θ Φ(x i )] θ 2 D n i=1 = l 0 n +2GE n D1 ε i Φ(x i ) n i=1 2 l 0 +2GD 1 n 2 E ε i Φ(x i ) n n 2(l 0+GRD) n i=1 2 Overall, we get, with probability 1 δ: sup f(θ) ˆf(θ) 1 ( n l0 +GRD)(4+ θ Θ 2log 1 δ )

39 Putting it all together We have, with probability 1 δ, for all θ Θ: f(θ) f(θ ) [ f(θ) ˆf(θ) ] + [ˆf(θ) min θ Θ 2 (l 0 +GRD)(4+ n ˆf(θ ) ] + [ min θ Θ 2log 1 δ )+[ˆf(θ) min θ Θ ˆf(θ ) ˆf(θ ) ] ˆf(θ ) ] Only need to optimize with precision 2 n (l 0 +GRD)

40 Slow rate for supervised learning (summary) Assumptions (f is the expected risk, ˆf the empirical risk) Ω(θ) = θ 2 (Euclidean norm) Linear predictors: θ(x) = θ Φ(x), with Φ(x) 2 R a.s. G-Lipschitz loss: f and ˆf are GR-Lipschitz on C = { θ 2 D} No assumptions regarding convexity With probability greater than 1 δ sup θ C ˆf(θ) f(θ) (l 0+GRD) n [2+ Expectated estimation error: E [ sup θ C 2log 2 δ ˆf(θ) f(θ) ] 4(l 0+GRD) n Using Rademacher averages (see, e.g., Boucheron et al., 2005) Lipschitz functions slow rate ]

41 Motivation from mean estimation Estimator ˆθ = 1 n n i=1 z i = argmin θ R 1 2n n i=1 (θ z i) 2 = ˆf(θ) From before: f(θ) = 1 2 E(θ z)2 = 1 2 (θ Ez) var(z) = ˆf(θ)+O(1/ n) f(ˆθ) = 1 2 (ˆθ Ez) var(z) = f(ez)+o(1/ n)

42 Motivation from mean estimation Estimator ˆθ = 1 n n i=1 z i = argmin θ R 1 2n n i=1 (θ z i) 2 = ˆf(θ) From before: f(θ) = 1 2 E(θ z)2 = 1 2 (θ Ez) var(z) = ˆf(θ)+O(1/ n) f(ˆθ) = 1 2 (ˆθ Ez) var(z) = f(ez)+o(1/ n) More refined/direct bound: f(ˆθ) f(ez) = 1 2 (ˆθ Ez) 2 E [ f(ˆθ) f(ez) ] = 1 2 E ( 1 n n i=1 ) 2 z i Ez = 1 2n var(z) Bound only at ˆθ + strong convexity

43 Fast rate for supervised learning Assumptions (f is the expected risk, ˆf the empirical risk) Same as before (bounded features, Lipschitz loss) Regularized risks: f µ (θ) = f(θ)+ µ 2 θ 2 2 and ˆf µ (θ) = ˆf(θ)+ µ 2 θ 2 2 Convexity For any a > 0, with probability greater than 1 δ, for all θ R d, f µ (θ) min (η) (1+a)(ˆf µ (θ) min ˆfµ (η))+ 8(1+ 1 a )G2 R 2 (32+log 1 δ ) η R dfµ η R d µn Results from Sridharan, Srebro, and Shalev-Shwartz (2008) see also Boucheron and Massart (2011) and references therein Strongly convex functions fast rate Warning: µ should decrease with n to reduce approximation error

44 Outline 1. Large-scale machine learning and optimization Traditional statistical analysis Classical methods for convex optimization 2. Non-smooth stochastic approximation Stochastic (sub)gradient and averaging Non-asymptotic results and lower bounds Strongly convex vs. non-strongly convex 3. Smooth stochastic approximation algorithms Asymptotic and non-asymptotic results 4. Beyond decaying step-sizes 5. Finite data sets

45 Complexity results in convex optimization Assumption: f convex on R d Classical generic algorithms (sub)gradient method/descent Accelerated gradient descent Newton method Key additional properties of f Lipschitz continuity, smoothness or strong convexity Key insight from Bottou and Bousquet (2008) In machine learning, no need to optimize below estimation error Key reference: Nesterov (2004)

46 Assumptions Subgradient method/descent f convex and B-Lipschitz-continuous on { θ 2 D} ( Algorithm: θ t = Π D θ t 1 2D ) B t f (θ t 1 ) Π D : orthogonal projection onto { θ 2 D} Bound: Three-line proof f ( 1 t t 1 k=0 θ k ) f(θ ) 2DB t Best possible convergence rate after O(d) iterations

47 Subgradient method/descent - proof - I Iteration: θ t = Π D (θ t 1 γ t f (θ t 1 )) with γ t = 2D B t Assumption: f (θ) 2 B and θ 2 D θ t θ 2 2 θ t 1 θ γ t f (θ t 1 ) 2 2 by contractivity of projections θ t 1 θ 2 2+B 2 γ 2 t 2γ t (θ t 1 θ ) f (θ t 1 ) because f (θ t 1 ) 2 B θ t 1 θ 2 2+B 2 γ 2 t 2γ t [ f(θt 1 ) f(θ ) ] (property of subgradients) leading to f(θ t 1 ) f(θ ) B2 γ t [ θt 1 θ 2 2γ 2 θ t θ 2 ] 2 t

48 Starting from Subgradient method/descent - proof - II t [ f(θu 1 ) f(θ ) ] u=1 f(θ t 1 ) f(θ ) B2 γ t 2 = = Using convexity: f t u=1 t u=1 ( 1 t t u=1 t u=1 B 2 γ u 2 B 2 γ u 2 B 2 γ u 2 B 2 γ u 2 t 1 k=0 + + t u=1 t 1 u=1 + θ k ) t 1 u=1 + 1 [ θt 1 θ 2 2γ 2 θ t θ 2 ] 2 t 1 [ θu 1 θ 2 2γ 2 θ u θ 2 ] 2 u ( θ u θ ) θ 0 θ θ t θ 2 2 2γ u+1 2γ u 2γ 1 2γ t 4D 2( 1 2γ u+1 1 2γ u ) + 4D 2 2γ 1 + 4D2 2γ t 2DB t with γ t = 2D B t f(θ ) 2DB t

49 Subgradient descent for machine learning Assumptions (f is the expected risk, ˆf the empirical risk) Linear predictors: θ(x) = θ Φ(x), with Φ(x) 2 R a.s. ˆf(θ) = 1 n n i=1 l(y i,φ(x i ) θ) G-Lipschitz loss: f and ˆf are GR-Lipschitz on C = { θ 2 D} Statistics: with probability greater than 1 δ ˆf(θ) f(θ) GRD [2+ n sup θ C 2log 2 δ ] Optimization: after t iterations of subgradient method ˆf(ˆθ) min η C ˆf(η) GRD t t = n iterations, with total running-time complexity of O(n 2 d)

50 Subgradient descent - strong convexity Assumptions f convex and B-Lipschitz-continuous on { θ 2 D} f µ-strongly convex ) 2 Algorithm: θ t = Π D (θ t 1 µ(t+1) f (θ t 1 ) Bound: f ( 2 t(t+1) t k=1 kθ k 1 ) f(θ ) 2B2 µ(t+1) Three-line proof Best possible convergence rate after O(d) iterations

51 Subgradient method - strong convexity - proof - I Iteration: θ t = Π D (θ t 1 γ t f (θ t 1 )) with γ t = 2 µ(t+1) Assumption: f (θ) 2 B and θ 2 D and µ-strong convexity of f θ t θ 2 2 θ t 1 θ γ t f (θ t 1 ) 2 2 by contractivity of projections θ t 1 θ 2 2+B 2 γt 2 2γ t (θ t 1 θ ) f (θ t 1 ) because f (θ t 1 ) 2 B θ t 1 θ 2 2+B 2 γt 2 [ 2γ t f(θt 1 ) f(θ )+ µ 2 θ t 1 θ 2 ] 2 (property of subgradients and strong convexity) leading to f(θ t 1 ) f(θ ) B2 γ t B 2 µ(t+1) + µ 2 [1 γ t µ ] θ t 1 θ γ t θ t θ 2 2 [t 1 2 ] θt 1 θ 2 2 µ(t+1) θ t θ 2 2 4

52 Subgradient method - strong convexity - proof - II From f(θ t 1 ) f(θ ) B2 µ(t+1) + µ 2 [t 1 2 ] θt 1 θ 2 2 µ(t+1) θ t θ t u [ f(θ u 1 ) f(θ ) ] u=1 u t=1 B 2 u µ(u+1) t [ u(u 1) θu 1 θ 2 2 u(u+1) θ u θ 2 2] u=1 B2 t µ + 1 [ 0 t(t+1) θt θ 2 B 4 2] 2 t µ Using convexity: f ( 2 t(t+1) t u=1 uθ u 1 ) f(θ ) 2B2 t+1

53 (smooth) gradient descent Assumptions f convex with L-Lipschitz-continuous gradient Minimum attained at θ Algorithm: Bound: Three-line proof θ t = θ t 1 1 L f (θ t 1 ) f(θ t ) f(θ ) 2L θ 0 θ 2 t+4 Not best possible convergence rate after O(d) iterations

54 (smooth) gradient descent - strong convexity Assumptions f convex with L-Lipschitz-continuous gradient f µ-strongly convex Algorithm: Bound: θ t = θ t 1 1 L f (θ t 1 ) f(θ t ) f(θ ) (1 µ/l) t[ f(θ 0 ) f(θ ) ] Three-line proof Adaptivity of gradient descent to problem difficulty Line search

55 Accelerated gradient methods (Nesterov, 1983) Assumptions f convex with L-Lipschitz-cont. gradient, min. attained at θ Algorithm: θ t = η t 1 1 L f (η t 1 ) Bound: η t = θ t + t 1 t+2 (θ t θ t 1 ) f(θ t ) f(θ ) 2L θ 0 θ 2 (t+1) 2 Ten-line proof (see, e.g., Schmidt, Le Roux, and Bach, 2011) Not improvable Extension to strongly convex functions

56 Optimization for sparsity-inducing norms (see Bach, Jenatton, Mairal, and Obozinski, 2011) Gradient descent as a proximal method (differentiable functions) θ t+1 = arg min θ R df(θ t)+(θ θ t ) f(θ t )+ L 2 θ θ t 2 2 θ t+1 = θ t 1 L f(θ t)

57 Optimization for sparsity-inducing norms (see Bach, Jenatton, Mairal, and Obozinski, 2011) Gradient descent as a proximal method (differentiable functions) θ t+1 = arg min θ R df(θ t)+(θ θ t ) f(θ t )+ L 2 θ θ t 2 2 θ t+1 = θ t 1 L f(θ t) Problems of the form: min θ R df(θ)+µω(θ) θ t+1 = arg min θ R df(θ t)+(θ θ t ) f(θ t )+µω(θ)+ L 2 θ θ t 2 2 Ω(θ) = θ 1 Thresholded gradient descent Similar convergence rates than smooth optimization Acceleration methods (Nesterov, 2007; Beck and Teboulle, 2009)

58 Summary: minimizing convex functions Assumption: f convex Gradient descent: θ t = θ t 1 γ t f (θ t 1 ) O(1/ t) convergence rate for non-smooth convex functions O(1/t) convergence rate for smooth convex functions O(e ρt ) convergence rate for strongly smooth convex functions Newton method: θ t = θ t 1 f (θ t 1 ) 1 f (θ t 1 ) O ( e ρ2t ) convergence rate

59 Summary: minimizing convex functions Assumption: f convex Gradient descent: θ t = θ t 1 γ t f (θ t 1 ) O(1/ t) convergence rate for non-smooth convex functions O(1/t) convergence rate for smooth convex functions O(e ρt ) convergence rate for strongly smooth convex functions Newton method: θ t = θ t 1 f (θ t 1 ) 1 f (θ t 1 ) O ( e ρ2t ) convergence rate Key insights from Bottou and Bousquet (2008) 1. In machine learning, no need to optimize below statistical error 2. In machine learning, cost functions are averages Stochastic approximation

60 Outline 1. Large-scale machine learning and optimization Traditional statistical analysis Classical methods for convex optimization 2. Non-smooth stochastic approximation Stochastic (sub)gradient and averaging Non-asymptotic results and lower bounds Strongly convex vs. non-strongly convex 3. Smooth stochastic approximation algorithms Asymptotic and non-asymptotic results 4. Beyond decaying step-sizes 5. Finite data sets

61 Stochastic approximation Goal: Minimizing a function f defined on R d given only unbiased estimates f n(θ n ) of its gradients f (θ n ) at certain points θ n R d

62 Stochastic approximation Goal: Minimizing a function f defined on R d given only unbiased estimates f n(θ n ) of its gradients f (θ n ) at certain points θ n R d Machine learning - statistics loss for a single pair of observations: f n (θ) = l(y n,θ Φ(x n )) f(θ) = Ef n (θ) = El(y n,θ Φ(x n )) = generalization error Expected gradient: f (θ) = Ef n(θ) = E { l (y n,θ Φ(x n ))Φ(x n ) } Non-asymptotic results Number of iterations = number of observations

63 Stochastic approximation Goal: Minimizing a function f defined on R d given only unbiased estimates f n (θ n) of its gradients f (θ n ) at certain points θ n R d Stochastic approximation (much) broader applicability beyond convex optimization θ n = θ n 1 γ n h n (θ n 1 ) with E [ h n (θ n 1 ) θ n 1 ] = h(θn 1 ) Beyond convex problems, i.i.d assumption, finite dimension, etc. Typically asymptotic results See, e.g., Kushner and Yin (2003); Borkar (2008); Benveniste et al. (2012)

64 Stochastic approximation Relationship to online learning Minimize f(θ) = E z l(θ,z) = generalization error of θ Using the gradients of single i.i.d. observations

65 Stochastic approximation Relationship to online learning Minimize f(θ) = E z l(θ,z) = generalization error of θ Using the gradients of single i.i.d. observations Batch learning Finite set of observations: z 1,...,z n Empirical risk: ˆf(θ) = 1 n n k=1 l(θ,z i) Estimator ˆθ = Minimizer of ˆf(θ) over a certain class Θ Generalization bound using uniform concentration results

66 Stochastic approximation Relationship to online learning Minimize f(θ) = E z l(θ,z) = generalization error of θ Using the gradients of single i.i.d. observations Batch learning Finite set of observations: z 1,...,z n Empirical risk: ˆf(θ) = 1 n n k=1 l(θ,z i) Estimator ˆθ = Minimizer of ˆf(θ) over a certain class Θ Generalization bound using uniform concentration results Online learning Update ˆθ n after each new (potentially adversarial) observation z n Cumulative loss: 1 n n k=1 l(ˆθ k 1,z k ) Online to batch through averaging (Cesa-Bianchi et al., 2004)

67 Convex stochastic approximation Key properties of f and/or f n Smoothness: f B-Lipschitz continuous, f L-Lipschitz continuous Strong convexity: f µ-strongly convex

68 Convex stochastic approximation Key properties of f and/or f n Smoothness: f B-Lipschitz continuous, f L-Lipschitz continuous Strong convexity: f µ-strongly convex Key algorithm: Stochastic gradient descent (a.k.a. Robbins-Monro) θ n = θ n 1 γ n f n(θ n 1 ) Polyak-Ruppert averaging: θ n = 1 n 1 n k=0 θ k Which learning rate sequence γ n? Classical setting: γ n = Cn α

69 Convex stochastic approximation Key properties of f and/or f n Smoothness: f B-Lipschitz continuous, f L-Lipschitz continuous Strong convexity: f µ-strongly convex Key algorithm: Stochastic gradient descent (a.k.a. Robbins-Monro) θ n = θ n 1 γ n f n(θ n 1 ) Polyak-Ruppert averaging: θ n = 1 n 1 n k=0 θ k Which learning rate sequence γ n? Classical setting: γ n = Cn α Desirable practical behavior Applicable (at least) to classical supervised learning problems Robustness to (potentially unknown) constants (L,B,µ) Adaptivity to difficulty of the problem (e.g., strong convexity)

70 Assumptions Stochastic subgradient descent/method f n convex and B-Lipschitz-continuous on { θ 2 D} (f n ) i.i.d. functions such that Ef n = f θ global optimum of f on { θ 2 D} ( Algorithm: θ n = Π D θ n 1 2D ) B n f n(θ n 1 ) Bound: Ef ( 1 n n 1 k=0 θ k ) f(θ ) 2DB n Same three-line proof as in the deterministic case Minimax convergence rate Running-time complexity: O(dn) after n iterations

71 Stochastic subgradient method - proof - I Iteration: θ n = Π D (θ n 1 γ n f n(θ n 1 )) with γ n = 2D B n F n : information up to time n f n(θ) 2 B and θ 2 D, unbiased gradients/functions E(f n F n 1 ) = f θ n θ 2 2 θ n 1 θ γ n f n(θ n 1 ) 2 2 by contractivity of projections θ n 1 θ 2 2 +B 2 γ 2 n 2γ n (θ n 1 θ ) f n(θ n 1 ) because f n(θ n 1 ) 2 B E [ θ n θ 2 2 F n 1 ] θn 1 θ 2 2+B 2 γ 2 n 2γ n (θ n 1 θ ) f (θ n 1 ) θ n 1 θ 2 2+B 2 γ 2 n 2γ n [ f(θn 1 ) f(θ ) ] (subgradient property) E θ n θ 2 2 E θ n 1 θ 2 2+B 2 γ 2 n 2γ n [ Ef(θn 1 ) f(θ ) ] leading to Ef(θ n 1 ) f(θ ) B2 γ n [ ] E θn 1 θ 2 2γ 2 E θ n θ 2 2 n

72 Stochastic subgradient method - proof - II Starting from Ef(θ n 1 ) f(θ ) B2 γ n [ ] E θn 1 θ 2 2 E θ n θ 2 2 2γ n n [ Ef(θu 1 ) f(θ ) ] u=1 n u=1 n u=1 B 2 γ u 2 B 2 γ u 2 + n u=1 1 [ E θu 1 θ 2 2γ 2 E θ u θ 2 ] 2 u + 4D2 2γ n 2DB n with γ n = 2D B n Using convexity: Ef ( 1 n n 1 k=0 θ k ) f(θ ) 2DB n

73 Stochastic subgradient descent - strong convexity - I Assumptions f n convex and B-Lipschitz-continuous (f n ) i.i.d. functions such that Ef n = f f µ-strongly convex on { θ 2 D} θ global optimum of f over { θ 2 D} ) 2 Algorithm: θ n = Π D (θ n 1 µ(n+1) f n(θ n 1 ) Bound: ( 2 Ef n(n+1) n ) kθ k 1 k=1 f(θ ) 2B2 µ(n+1) Same three-line proof than in the deterministic case Minimax convergence rate

74 Stochastic subgradient descent - strong convexity - II Assumptions f n convex and B-Lipschitz-continuous (f n ) i.i.d. functions such that Ef n = f θ global optimum of g = f + µ No compactness assumption - no projections Algorithm: 2 2 [ ] θ n = θ n 1 µ(n+1) g n(θ n 1 ) = θ n 1 f µ(n+1) n (θ n 1 )+µθ n 1 ( 2 Bound: Eg n(n+1) n ) kθ k 1 k=1 Minimax convergence rate g(θ ) 2B2 µ(n+1)

75 Outline 1. Large-scale machine learning and optimization Traditional statistical analysis Classical methods for convex optimization 2. Non-smooth stochastic approximation Stochastic (sub)gradient and averaging Non-asymptotic results and lower bounds Strongly convex vs. non-strongly convex 3. Smooth stochastic approximation algorithms Asymptotic and non-asymptotic results 4. Beyond decaying step-sizes 5. Finite data sets

76 Convex stochastic approximation Existing work Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Bottou and Le Cun (2005); Bottou and Bousquet (2008); Hazan et al. (2007); Shalev-Shwartz and Srebro (2008); Shalev-Shwartz et al. (2007, 2009); Xiao (2010); Duchi and Singer (2009); Nesterov and Vial (2008); Nemirovski et al. (2009)

77 Convex stochastic approximation Existing work Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Many contributions in optimization and online learning: Bottou and Le Cun (2005); Bottou and Bousquet (2008); Hazan et al. (2007); Shalev-Shwartz and Srebro (2008); Shalev-Shwartz et al. (2007, 2009); Xiao (2010); Duchi and Singer (2009); Nesterov and Vial (2008); Nemirovski et al. (2009)

78 Convex stochastic approximation Existing work Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Asymptotic analysis of averaging (Polyak and Juditsky, 1992; Ruppert, 1988) All step sizes γ n = Cn α with α (1/2,1) lead to O(n 1 ) for smooth strongly convex problems A single algorithm with global adaptive convergence rate for smooth problems?

79 Convex stochastic approximation Existing work Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Asymptotic analysis of averaging (Polyak and Juditsky, 1992; Ruppert, 1988) All step sizes γ n = Cn α with α (1/2,1) lead to O(n 1 ) for smooth strongly convex problems Non-asymptotic analysis for smooth problems? problems with convergence rate O(min{1/µn,1/ n})

80 Smoothness/convexity assumptions Iteration: θ n = θ n 1 γ n f n(θ n 1 ) Polyak-Ruppert averaging: θ n = 1 n n 1 k=0 θ k Smoothness of f n : For each n 1, the function f n is a.s. convex, differentiable with L-Lipschitz-continuous gradient f n: Smooth loss and bounded data Strong convexity of f: The function f is strongly convex with respect to the norm, with convexity constant µ > 0: Invertible population covariance matrix or regularization by µ 2 θ 2

81 Summary of new results (Bach and Moulines, 2011) Stochastic gradient descent with learning rate γ n = Cn α Strongly convex smooth objective functions Old: O(n 1 ) rate achieved without averaging for α = 1 New: O(n 1 ) rate achieved with averaging for α [1/2,1] Non-asymptotic analysis with explicit constants Forgetting of initial conditions Robustness to the choice of C

82 Summary of new results (Bach and Moulines, 2011) Stochastic gradient descent with learning rate γ n = Cn α Strongly convex smooth objective functions Old: O(n 1 ) rate achieved without averaging for α = 1 New: O(n 1 ) rate achieved with averaging for α [1/2,1] Non-asymptotic analysis with explicit constants Forgetting of initial conditions Robustness to the choice of C Convergence rates for E θ n θ 2 and E θ n θ 2 ( σ 2 γ ) n no averaging: O +O(e µnγ n ) θ 0 θ 2 µ averaging: trh(θ ) 1 ( +µ 1 O(n 2α +n 2+α θ0 θ 2 ) )+O n µ 2 n 2

83 Classical proof sketch (no averaging) θ n θ 2 2 = θ n 1 γ n f n(θ n 1 ) θ 2 2 = θ n 1 θ 2 2 2γ n (θ n 1 θ ) f n(θ n 1 )+γ 2 n f n(θ n 1 ) 2 2 θ n 1 θ 2 2 2γ n (θ n 1 θ ) f n(θ n 1 ) E [ θ n θ 2 2 F n 1 ] +2γn f 2 n(θ ) 2 2+2γn f 2 n(θ n 1 ) f n(θ ) 2 2 θ n 1 θ 2 2 2γ n (θ n 1 θ ) f n(θ n 1 ) +2γn f 2 n(θ ) 2 2+2γnL[f 2 n(θ n 1 ) f n(θ )] (θ n 1 θ ) θ n 1 θ 2 2 2γ n (θ n 1 θ ) f (θ n 1 ) +2γnE f 2 n(θ ) 2 2+2γnL[f 2 (θ n 1 ) 0] (θ n 1 θ ) θ n 1 θ 2 2 2γ n (1 γ n L)(θ n 1 θ ) f (θ n 1 )+2γnσ 2 2 θ n 1 θ 2 2 2γ n (1 γ n L) 1 2 µ θ n 1 θ 2 2+2γ 2 nσ 2 E [ θ n 1 θ 2 ] 2 = [ 1 µγ n (1 γ n L) ] θ n 1 θ 2 2+2γnσ 2 2 [ 1 µγ n (1 γ n L) ] E [ θ n 1 θ 2 ] 2 +2γ 2 n σ 2

84 Proof sketch (averaging) From Polyak and Juditsky (1992): θ n = θ n 1 γ n f n(θ n 1 ) f n(θ n 1 ) = 1 γ n (θ n 1 θ n ) f n(θ )+f n(θ )(θ n 1 θ ) = 1 γ n (θ n 1 θ n )+O( θ n 1 θ 2 ) f n(θ )+f (θ )(θ n 1 θ ) = 1 γ n (θ n 1 θ n )+O( θ n 1 θ 2 ) +O( θ n 1 θ )ε n θ n 1 θ = f (θ ) 1 f n(θ )+ 1 γ n f (θ ) 1 (θ n 1 θ n ) +O( θ n 1 θ 2 )+O( θ n 1 θ )ε n Averaging to cancel the term 1 γ n f (θ ) 1 (θ n 1 θ n )

85 Robustness to wrong constants for γ n = Cn α f(θ) = 1 2 θ 2 with i.i.d. Gaussian noise (d = 1) Left: α = 1/2 Right: α = 1 log[f(θ n ) f ] 5 0 α = 1/2 sgd C=1/5 ave C=1/5 sgd C=1 ave C=1 sgd C=5 ave C=5 log[f(θ n ) f ] 5 0 α = 1 sgd C=1/5 ave C=1/5 sgd C=1 ave C=1 sgd C=5 ave C= log(n) log(n) See also

86 Summary of new results (Bach and Moulines, 2011) Stochastic gradient descent with learning rate γ n = Cn α Strongly convex smooth objective functions Old: O(n 1 ) rate achieved without averaging for α = 1 New: O(n 1 ) rate achieved with averaging for α [1/2,1] Non-asymptotic analysis with explicit constants

87 Summary of new results (Bach and Moulines, 2011) Stochastic gradient descent with learning rate γ n = Cn α Strongly convex smooth objective functions Old: O(n 1 ) rate achieved without averaging for α = 1 New: O(n 1 ) rate achieved with averaging for α [1/2,1] Non-asymptotic analysis with explicit constants Non-strongly convex smooth objective functions Old: O(n 1/2 ) rate achieved with averaging for α = 1/2 New: O(max{n 1/2 3α/2,n α/2,n α 1 }) rate achieved without averaging for α [1/3,1] Take-home message Use α = 1/2 with averaging to be adaptive to strong convexity

88 Beyond stochastic gradient method Adding a proximal step Goal: min θ R df(θ)+ω(θ) = Ef n(θ)+ω(θ) Replace recursion θ n = θ n 1 γ n f n(θ n ) by θ n = min θ R d θ θ n 1 +γ n f n(θ n ) 2 2 +CΩ(θ) Xiao (2010); Hu et al. (2009) May be accelerated (Ghadimi and Lan, 2013) Related frameworks Regularized dual averaging (Nesterov, 2009; Xiao, 2010) Mirror descent (Nemirovski et al., 2009; Lan et al., 2012)

89 Outline 1. Large-scale machine learning and optimization Traditional statistical analysis Classical methods for convex optimization 2. Non-smooth stochastic approximation Stochastic (sub)gradient and averaging Non-asymptotic results and lower bounds Strongly convex vs. non-strongly convex 3. Smooth stochastic approximation algorithms Asymptotic and non-asymptotic results 4. Beyond decaying step-sizes 5. Finite data sets

90 Convex stochastic approximation Existing work Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Asymptotic analysis of averaging (Polyak and Juditsky, 1992; Ruppert, 1988) All step sizes γ n = Cn α with α (1/2,1) lead to O(n 1 ) for smooth strongly convex problems A single adaptive algorithm for smooth problems with convergence rate O(min{1/µn,1/ n}) in all situations?

91 Adaptive algorithm for logistic regression Logistic regression: (Φ(x n ),y n ) R d { 1,1} Single data point: f n (θ) = log(1+exp( y n θ Φ(x n ))) Generalization error: f(θ) = Ef n (θ)

92 Adaptive algorithm for logistic regression Logistic regression: (Φ(x n ),y n ) R d { 1,1} Single data point: f n (θ) = log(1+exp( y n θ Φ(x n ))) Generalization error: f(θ) = Ef n (θ) Cannot be strongly convex local strong convexity unless restricted to θ Φ(x n ) M (and with constants e M ) µ = lowest eigenvalue of the Hessian at the optimum f (θ ) logistic loss

93 Adaptive algorithm for logistic regression Logistic regression: (Φ(x n ),y n ) R d { 1,1} Single data point: f n (θ) = log(1+exp( y n θ Φ(x n ))) Generalization error: f(θ) = Ef n (θ) Cannot be strongly convex local strong convexity unless restricted to θ Φ(x n ) M (and with constants e M ) µ = lowest eigenvalue of the Hessian at the optimum f (θ ) n steps of averaged SGD with constant step-size 1/ ( 2R 2 n ) with R = radius of data (Bach, 2013): Ef( θ n ) f(θ ) min { } 1 n, R2 (15+5R θ0 θ ) 4 nµ Proof based on self-concordance (Nesterov and Nemirovski, 1994)

94 Self-concordance Usual definition for convex ϕ : R R: ϕ (t) 2ϕ (t) 3/2 Affine invariant Extendable to all convex functions on R d by looking at rays Used for the sharp proof of quadratic convergence of Newton method (Nesterov and Nemirovski, 1994) Generalized notion: ϕ (t) ϕ (t) Applicable to logistic regression (with extensions)

95 Self-concordance Usual definition for convex ϕ : R R: ϕ (t) 2ϕ (t) 3/2 Affine invariant Extendable to all convex functions on R d by looking at rays Used for the sharp proof of quadratic convergence of Newton method (Nesterov and Nemirovski, 1994) Generalized notion: ϕ (t) ϕ (t) Applicable to logistic regression (with extensions) Important properties Allows global Taylor expansions Relates expansions of derivatives of different orders

96 Adaptive algorithm for logistic regression Proof sketch Step 1: use existing result f( θ n ) f(θ )+ R2 n θ 0 θ 2 2 = O(1/ n) Step 2: f n(θ n 1 ) = 1 γ (θ n 1 θ n ) n 1 n k=1 f k (θ k 1) = 1 nγ (θ 0 θ n ) Step 3: f ( 1 n n k=1 θ ) k 1 1 n n k=1 f (θ k 1 ) 2 = O ( f( θ n ) f(θ ) ) = O(1/ n) using self-concordance Step 4a: if f µ-strongly convex, f( θ n ) f(θ ) 1 2µ f ( θ n ) 2 2 Step 4b: if f self-concordant, locally true with µ = λ min (f (θ ))

97 Adaptive algorithm for logistic regression Logistic regression: (Φ(x n ),y n ) R d { 1,1} Single data point: f n (θ) = log(1+exp( y n θ Φ(x n ))) Generalization error: f(θ) = Ef n (θ) Cannot be strongly convex local strong convexity unless restricted to θ Φ(x n ) M (and with constants e M ) µ = lowest eigenvalue of the Hessian at the optimum f (θ ) n steps of averaged SGD with constant step-size 1/ ( 2R 2 n ) with R = radius of data (Bach, 2013): Ef( θ n ) f(θ ) min { } 1 n, R2 (15+5R θ0 θ ) 4 nµ Proof based on self-concordance (Nesterov and Nemirovski, 1994)

98 Adaptive algorithm for logistic regression Logistic regression: (Φ(x n ),y n ) R d { 1,1} Single data point: f n (θ) = log(1+exp( y n θ Φ(x n ))) Generalization error: f(θ) = Ef n (θ) Cannot be strongly convex local strong convexity unless restricted to θ Φ(x n ) M (and with constants e M ) µ = lowest eigenvalue of the Hessian at the optimum f (θ ) n steps of averaged SGD with constant step-size 1/ ( 2R 2 n ) with R = radius of data (Bach, 2013): Ef( θ n ) f(θ ) min { } 1 n, R2 (15+5R θ0 θ ) 4 nµ A single adaptive algorithm for smooth problems with convergence rate O(1/n) in all situations?

99 Least-mean-square algorithm Least-squares: f(θ) = 1 2 E[ (y n Φ(x n ),θ ) 2] with θ R d SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) usually studied without averaging and decreasing step-sizes with strong convexity assumption E [ Φ(x n ) Φ(x n ) ] = H µ Id

100 Least-mean-square algorithm Least-squares: f(θ) = 1 2 E[ (y n Φ(x n ),θ ) 2] with θ R d SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) usually studied without averaging and decreasing step-sizes with strong convexity assumption E [ Φ(x n ) Φ(x n ) ] = H µ Id New analysis for averaging and constant step-size γ = 1/(4R 2 ) Assume Φ(x n ) R and y n Φ(x n ),θ σ almost surely No assumption regarding lowest eigenvalues of H Main result: Ef( θ n 1 ) f(θ ) 4σ2 d n + 4R2 θ 0 θ 2 n Matches statistical lower bound (Tsybakov, 2003) Non-asymptotic robust version of Györfi and Walk (1996)

101 Least-squares - Proof technique LMS recursion: θ n θ = [ I γφ(x n ) Φ(x n ) ] (θ n 1 θ )+γε n Φ(x n ) Simplified LMS recursion: with H = E [ Φ(x n ) Φ(x n ) ] θ n θ = [ I γh ] (θ n 1 θ )+γε n Φ(x n ) Direct proof technique of Polyak and Juditsky (1992), e.g., θ n θ = [ I γh ] n (θ0 θ )+γ n k=1 [ I γh ] n kεk Φ(x k ) Infinite expansion of Aguech, Moulines, and Priouret(2000) in powers of γ

102 Markov chain interpretation of constant step sizes LMS recursion for f n (θ) = 1 2 ( yn Φ(x n ),θ ) 2 θ n = θ n 1 γ ( Φ(x n ),θ n 1 y n ) Φ(xn ) The sequence (θ n ) n is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) - For least-squares, θ γ = θ θ n θ γ θ 0

103 Markov chain interpretation of constant step sizes LMS recursion for f n (θ) = 1 2 ( yn Φ(x n ),θ ) 2 θ n = θ n 1 γ ( Φ(x n ),θ n 1 y n ) Φ(xn ) The sequence (θ n ) n is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) For least-squares, θ γ = θ θ n θ θ 0

104 Markov chain interpretation of constant step sizes LMS recursion for f n (θ) = 1 2 ( yn Φ(x n ),θ ) 2 θ n = θ n 1 γ ( Φ(x n ),θ n 1 y n ) Φ(xn ) The sequence (θ n ) n is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) For least-squares, θ γ = θ θ n θ n θ θ 0

105 Markov chain interpretation of constant step sizes LMS recursion for f n (θ) = 1 2 ( yn Φ(x n ),θ ) 2 θ n = θ n 1 γ ( Φ(x n ),θ n 1 y n ) Φ(xn ) The sequence (θ n ) n is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) For least-squares, θ γ = θ θ n does not converge to θ but oscillates around it oscillations of order γ Ergodic theorem: Averaged iterates converge to θ γ = θ at rate O(1/n)

106 Simulations - synthetic examples Gaussian distributions - p = 20 0 synthetic square log 10 [f(θ) f(θ * )] /2R 2 1/8R 2 4 1/32R 2 1/2R 2 n 1/ log 10 (n)

107 Simulations - benchmarks alpha (p = 500, n = ), news (p = , n = ) log 10 [f(θ) f(θ * )] alpha square C=1 test 1/R 2 1/R 2 n 1/2 SAG log (n) alpha square C=opt test C/R 2 C/R 2 n 1/2 SAG log (n) news square C=1 test 0.2 news square C=opt test log 10 [f(θ) f(θ * )] /R 2 1/R 2 n 1/2 SAG log (n) C/R 2 C/R 2 n 1/2 SAG log 10 (n)

108 Beyond least-squares - Markov chain interpretation Recursion θ n = θ n 1 γf n(θ n 1 ) also defines a Markov chain Stationary distribution π γ such that f (θ)π γ (dθ) = 0 When f is not linear, f ( θπ γ (dθ)) f (θ)π γ (dθ) = 0

109 Beyond least-squares - Markov chain interpretation Recursion θ n = θ n 1 γf n(θ n 1 ) also defines a Markov chain Stationary distribution π γ such that f (θ)π γ (dθ) = 0 When f is not linear, f ( θπ γ (dθ)) f (θ)π γ (dθ) = 0 θ n oscillates around the wrong value θ γ θ θ n θ n θ γ θ 0 θ

110 Beyond least-squares - Markov chain interpretation Recursion θ n = θ n 1 γf n(θ n 1 ) also defines a Markov chain Stationary distribution π γ such that f (θ)π γ (dθ) = 0 When f is not linear, f ( θπ γ (dθ)) f (θ)π γ (dθ) = 0 θ n oscillates around the wrong value θ γ θ moreover, θ θ n = O p ( γ) Ergodic theorem averaged iterates converge to θ γ θ at rate O(1/n) moreover, θ θ γ = O(γ) (Bach, 2013)

111 Simulations - synthetic examples Gaussian distributions - p = 20 0 synthetic logistic 1 log 10 [f(θ) f(θ * )] 1 2 1/2R 2 3 1/8R 2 4 1/32R 2 5 1/2R 2 n 1/ log 10 (n)

112 Restoring convergence through online Newton steps Known facts 1. Averaged SGD with γ n n 1/2 leads to robust rate O(n 1/2 ) for all convex functions 2. Averaged SGD with γ n constant leads to robust rate O(n 1 ) for all convex quadratic functions 3. Newton s method squares the error at each iteration for smooth functions 4. A single step of Newton s method is equivalent to minimizing the quadratic Taylor expansion Online Newton step Rate: O((n 1/2 ) 2 +n 1 ) = O(n 1 ) Complexity: O(p) per iteration

113 Restoring convergence through online Newton steps Known facts 1. Averaged SGD with γ n n 1/2 leads to robust rate O(n 1/2 ) for all convex functions 2. Averaged SGD with γ n constant leads to robust rate O(n 1 ) for all convex quadratic functions O(n 1 ) 3. Newton s method squares the error at each iteration for smooth functions O((n 1/2 ) 2 ) 4. A single step of Newton s method is equivalent to minimizing the quadratic Taylor expansion Online Newton step Rate: O((n 1/2 ) 2 +n 1 ) = O(n 1 ) Complexity: O(p) per iteration

114 Restoring convergence through online Newton steps The Newton step for f = Ef n (θ) def = E [ l(y n, θ,φ(x n ) ) ] at θ is equivalent to minimizing the quadratic approximation g(θ) = f( θ)+ f ( θ),θ θ θ θ,f ( θ)(θ θ) = f( θ)+ Ef n( θ),θ θ θ θ,ef n( θ)(θ θ) [ = E f( θ)+ f n( θ),θ θ θ θ,f n( θ)(θ θ) ]

115 Restoring convergence through online Newton steps The Newton step for f = Ef n (θ) def = E [ l(y n, θ,φ(x n ) ) ] at θ is equivalent to minimizing the quadratic approximation g(θ) = f( θ)+ f ( θ),θ θ θ θ,f ( θ)(θ θ) = f( θ)+ Ef n( θ),θ θ θ θ,ef n( θ)(θ θ) [ = E f( θ)+ f n( θ),θ θ θ θ,f n( θ)(θ θ) ] Complexity of least-mean-square recursion for g is O(p) θ n = θ n 1 γ [ f n( θ)+f n( θ)(θ n 1 θ) ] f n( θ) = l (y n, θ,φ(x n ) )Φ(x n ) Φ(x n ) has rank one New online Newton step without computing/inverting Hessians

116 Choice of support point for online Newton step Two-stage procedure (1) Run n/2 iterations of averaged SGD to obtain θ (2) Run n/2 iterations of averaged constant step-size LMS Reminiscent of one-step estimators (see, e.g., Van der Vaart, 2000) Provable convergence rate of O(p/n) for logistic regression Additional assumptions but no strong convexity

117 Logistic regression - Proof technique Using generalized self-concordance of ϕ : u log(1+e u ): ϕ (u) ϕ (u) NB: difference with regular self-concordance: ϕ (u) 2ϕ (u) 3/2 Using novel high-probability convergence results for regular averaged stochastic gradient descent Requires assumption on the kurtosis in every direction, i.e., E Φ(x n ),η 4 κ [ E Φ(x n ),η 2] 2

118 Choice of support point for online Newton step Two-stage procedure (1) Run n/2 iterations of averaged SGD to obtain θ (2) Run n/2 iterations of averaged constant step-size LMS Reminiscent of one-step estimators (see, e.g., Van der Vaart, 2000) Provable convergence rate of O(p/n) for logistic regression Additional assumptions but no strong convexity Update at each iteration using the current averaged iterate Recursion: θ n = θ n 1 γ [ f n( θ n 1 )+f n( θ n 1 )(θ n 1 θ n 1 ) ] No provable convergence rate (yet) but best practical behavior Note (dis)similarity with regular SGD: θ n = θ n 1 γf n(θ n 1 )

119 Online Newton algorithm Current proof (Flammarion et al., 2014) Recursion { θn = θ n 1 γ [ f n( θ n 1 )+f n( θ n 1 )(θ n 1 θ n 1 ) ] θ n = θ n n (θ n θ n 1 ) Instance of two-time-scale stochastic approximation (Borkar, 1997) Given θ, θ n = θ n 1 γ [ f n( θ) + f n( θ)(θ n 1 θ) ] defines a homogeneous Markov chain (fast dynamics) θ n is updated at rate 1/n (slow dynamics) Difficulty: preserving robustness to ill-conditioning

120 Simulations - synthetic examples Gaussian distributions - p = 20 0 synthetic logistic 1 0 synthetic logistic 2 log 10 [f(θ) f(θ * )] /2R 2 3 1/8R 2 3 every 2 p 4 1/32R 2 every iter. 4 2 step 5 1/2R 2 n 1/2 5 2 step dbl log 10 (n) log 10 (n) log 10 [f(θ) f(θ * )] 1

121 Simulations - benchmarks alpha (p = 500, n = ), news (p = , n = ) alpha logistic C=1 test alpha logistic C=opt test log 10 [f(θ) f(θ * )] /R /R 2 n 1/2 SAG 2 Adagrad Newton log (n) C/R C/R 2 n 1/2 SAG 2 Adagrad Newton log (n) news logistic C=1 test 0.2 news logistic C=opt test log 10 [f(θ) f(θ * )] /R /R 2 n 1/2 SAG 0.8 Adagrad 1 Newton log (n) C/R C/R 2 n 1/2 SAG 0.8 Adagrad 1 Newton log 10 (n)

122 Outline 1. Large-scale machine learning and optimization Traditional statistical analysis Classical methods for convex optimization 2. Non-smooth stochastic approximation Stochastic (sub)gradient and averaging Non-asymptotic results and lower bounds Strongly convex vs. non-strongly convex 3. Smooth stochastic approximation algorithms Asymptotic and non-asymptotic results 4. Beyond decaying step-sizes 5. Finite data sets

123 Going beyond a single pass over the data Stochastic approximation Assumes infinite data stream Observations are used only once Directly minimizes testing cost E (x,y) l(y,θ Φ(x))

124 Going beyond a single pass over the data Stochastic approximation Assumes infinite data stream Observations are used only once Directly minimizes testing cost E (x,y) l(y,θ Φ(x)) Machine learning practice Finite data set (x 1,y 1,...,x n,y n ) Multiple passes Minimizes training cost 1 n n i=1 l(y i,θ Φ(x i )) Need to regularize (e.g., by the l 2 -norm) to avoid overfitting Goal: minimize g(θ) = 1 n n f i (θ) i=1

125 Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i,θ Φ(x i ) ) +µω(θ) i=1 Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n Linear (e.g., exponential) convergence rate in O(e αt ) Iteration complexity is linear in n (with line search) n f i(θ t 1 ) i=1

126 Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i,θ Φ(x i ) ) +µω(θ) i=1 Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n n f i(θ t 1 ) i=1

127 Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i,θ Φ(x i ) ) +µω(θ) i=1 Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n Linear (e.g., exponential) convergence rate in O(e αt ) Iteration complexity is linear in n (with line search) n f i(θ t 1 ) i=1 Stochastic gradient descent: θ t = θ t 1 γ t f i(t) (θ t 1) Sampling with replacement: i(t) random element of {1,...,n} Convergence rate in O(1/t) Iteration complexity is independent of n (step size selection?)

128 Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i,θ Φ(x i ) ) +µω(θ) i=1 Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n n f i(θ t 1 ) i=1 Stochastic gradient descent: θ t = θ t 1 γ t f i(t) (θ t 1)

129 Stochastic vs. deterministic methods Goal = best of both worlds: Linear rate with O(1) iteration cost Robustness to step size log(excess cost) stochastic deterministic time

130 Stochastic vs. deterministic methods Goal = best of both worlds: Linear rate with O(1) iteration cost Robustness to step size log(excess cost) hybrid time stochastic deterministic

131 Accelerating gradient methods - Related work Nesterov acceleration Nesterov (1983, 2004) Better linear rate but still O(n) iteration cost Hybrid methods, incremental average gradient, increasing batch size Bertsekas (1997); Blatt et al. (2008); Friedlander and Schmidt (2011) Linear rate, but iterations make full passes through the data.

132 Accelerating gradient methods - Related work Momentum, gradient/iterate averaging, stochastic version of accelerated batch gradient methods Polyak and Juditsky (1992); Tseng (1998); Sunehag et al. (2009); Ghadimi and Lan (2010); Xiao (2010) Can improve constants, but still have sublinear O(1/t) rate Constant step-size stochastic gradient (SG), accelerated SG Kesten (1958); Delyon and Juditsky (1993); Solodov (1998); Nedic and Bertsekas (2000) Linear convergence, but only up to a fixed tolerance. Stochastic methods in the dual Shalev-Shwartz and Zhang (2012) Similar linear rate but limited choice for the f i s

133 Stochastic average gradient (Le Roux, Schmidt, and Bach, 2012) Stochastic average gradient (SAG) iteration Keep in memory the gradients of all functions f i, i = 1,...,n Random selection i(t) {1,...,n} with replacement { Iteration: θ t = θ t 1 γ n t f yi t with yi t n = i (θ t 1 ) if i = i(t) otherwise i=1 y t 1 i

Large-scale machine learning and convex optimization

Large-scale machine learning and convex optimization Large-scale machine learning and convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Allerton Conference - September 2015 Slides available at www.di.ens.fr/~fbach/gradsto_allerton.pdf

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Stochastic gradient descent and robustness to ill-conditioning

Stochastic gradient descent and robustness to ill-conditioning Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - September 2012 Context Machine

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine

More information

Statistical machine learning and convex optimization

Statistical machine learning and convex optimization Statistical machine learning and convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Mastère M2 - Paris-Sud - Spring 2018 Slides available: www.di.ens.fr/~fbach/fbach_orsay_2018.pdf

More information

Linearly-Convergent Stochastic-Gradient Methods

Linearly-Convergent Stochastic-Gradient Methods Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure

More information

Stochastic optimization: Beyond stochastic gradients and convexity Part I

Stochastic optimization: Beyond stochastic gradients and convexity Part I Stochastic optimization: Beyond stochastic gradients and convexity Part I Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint tutorial with Suvrit Sra, MIT - NIPS

More information

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Loucas Pillaud-Vivien

More information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Lecture 2 February 25th

Lecture 2 February 25th Statistical machine learning and convex optimization 06 Lecture February 5th Lecturer: Francis Bach Scribe: Guillaume Maillard, Nicolas Brosse This lecture deals with classical methods for convex optimization.

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Perturbed Proximal Gradient Algorithm

Perturbed Proximal Gradient Algorithm Perturbed Proximal Gradient Algorithm Gersende FORT LTCI, CNRS, Telecom ParisTech Université Paris-Saclay, 75013, Paris, France Large-scale inverse problems and optimization Applications to image processing

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Julien Mairal Inria, LEAR Team, Grenoble Journées MAS, Toulouse Julien Mairal Incremental and Stochastic

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Stochastic Gradient Descent. CS 584: Big Data Analytics

Stochastic Gradient Descent. CS 584: Big Data Analytics Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration

More information

Stochastic Optimization Part I: Convex analysis and online stochastic optimization

Stochastic Optimization Part I: Convex analysis and online stochastic optimization Stochastic Optimization Part I: Convex analysis and online stochastic optimization Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where

More information

Minimizing finite sums with the stochastic average gradient

Minimizing finite sums with the stochastic average gradient Math. Program., Ser. A (2017) 162:83 112 DOI 10.1007/s10107-016-1030-6 FULL LENGTH PAPER Minimizing finite sums with the stochastic average gradient Mark Schmidt 1 Nicolas Le Roux 2 Francis Bach 3 Received:

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and

More information

Asymptotic and finite-sample properties of estimators based on stochastic gradients

Asymptotic and finite-sample properties of estimators based on stochastic gradients Asymptotic and finite-sample properties of estimators based on stochastic gradients Panagiotis (Panos) Toulis panos.toulis@chicagobooth.edu Econometrics and Statistics University of Chicago, Booth School

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Stochastic Composition Optimization

Stochastic Composition Optimization Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression

On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression arxiv:606.03000v [cs.it] 9 Jun 206 Kobi Cohen, Angelia Nedić and R. Srikant Abstract The problem

More information

Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Estimators based on non-convex programs: Statistical and computational guarantees

Estimators based on non-convex programs: Statistical and computational guarantees Estimators based on non-convex programs: Statistical and computational guarantees Martin Wainwright UC Berkeley Statistics and EECS Based on joint work with: Po-Ling Loh (UC Berkeley) Martin Wainwright

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Stochastic Proximal Gradient Algorithm

Stochastic Proximal Gradient Algorithm Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Optimized first-order minimization methods

Optimized first-order minimization methods Optimized first-order minimization methods Donghwan Kim & Jeffrey A. Fessler EECS Dept., BME Dept., Dept. of Radiology University of Michigan web.eecs.umich.edu/~fessler UM AIM Seminar 2014-10-03 1 Disclosure

More information

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Simple Optimization, Bigger Models, and Faster Learning. Niao He Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture

More information

ONLINE VARIANCE-REDUCING OPTIMIZATION

ONLINE VARIANCE-REDUCING OPTIMIZATION ONLINE VARIANCE-REDUCING OPTIMIZATION Nicolas Le Roux Google Brain nlr@google.com Reza Babanezhad University of British Columbia rezababa@cs.ubc.ca Pierre-Antoine Manzagol Google Brain manzagop@google.com

More information

Stochastic Optimization: First order method

Stochastic Optimization: First order method Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

Algorithmic Stability and Generalization Christoph Lampert

Algorithmic Stability and Generalization Christoph Lampert Algorithmic Stability and Generalization Christoph Lampert November 28, 2018 1 / 32 IST Austria (Institute of Science and Technology Austria) institute for basic research opened in 2009 located in outskirts

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Constrained Stochastic Gradient Descent for Large-scale Least Squares Problem

Constrained Stochastic Gradient Descent for Large-scale Least Squares Problem Constrained Stochastic Gradient Descent for Large-scale Least Squares Problem ABSRAC Yang Mu University of Massachusetts Boston 100 Morrissey Boulevard Boston, MA, US 015 yangmu@cs.umb.edu ianyi Zhou University

More information

On the Convergence Rate of Incremental Aggregated Gradient Algorithms

On the Convergence Rate of Incremental Aggregated Gradient Algorithms On the Convergence Rate of Incremental Aggregated Gradient Algorithms M. Gürbüzbalaban, A. Ozdaglar, P. Parrilo June 5, 2015 Abstract Motivated by applications to distributed optimization over networks

More information

Ergodic Subgradient Descent

Ergodic Subgradient Descent Ergodic Subgradient Descent John Duchi, Alekh Agarwal, Mikael Johansson, Michael Jordan University of California, Berkeley and Royal Institute of Technology (KTH), Sweden Allerton Conference, September

More information

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve

More information

Randomized Smoothing for Stochastic Optimization

Randomized Smoothing for Stochastic Optimization Randomized Smoothing for Stochastic Optimization John Duchi Peter Bartlett Martin Wainwright University of California, Berkeley NIPS Big Learn Workshop, December 2011 Duchi (UC Berkeley) Smoothing and

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Mirror Descent for Metric Learning. Gautam Kunapuli Jude W. Shavlik

Mirror Descent for Metric Learning. Gautam Kunapuli Jude W. Shavlik Mirror Descent for Metric Learning Gautam Kunapuli Jude W. Shavlik And what do we have here? We have a metric learning algorithm that uses composite mirror descent (COMID): Unifying framework for metric

More information

Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Optimal Regularized Dual Averaging Methods for Stochastic Optimization Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization racle Complexity of Second-rder Methods for Smooth Convex ptimization Yossi Arjevani had Shamir Ron Shiff Weizmann Institute of Science Rehovot 7610001 Israel Abstract yossi.arjevani@weizmann.ac.il ohad.shamir@weizmann.ac.il

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Statistical Sparse Online Regression: A Diffusion Approximation Perspective

Statistical Sparse Online Regression: A Diffusion Approximation Perspective Statistical Sparse Online Regression: A Diffusion Approximation Perspective Jianqing Fan Wenyan Gong Chris Junchi Li Qiang Sun Princeton University Princeton University Princeton University University

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how

More information

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function

More information

Optimistic Rates Nati Srebro

Optimistic Rates Nati Srebro Optimistic Rates Nati Srebro Based on work with Karthik Sridharan and Ambuj Tewari Examples based on work with Andy Cotter, Elad Hazan, Tomer Koren, Percy Liang, Shai Shalev-Shwartz, Ohad Shamir, Karthik

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function

More information

Why should you care about the solution strategies?

Why should you care about the solution strategies? Optimization Why should you care about the solution strategies? Understanding the optimization approaches behind the algorithms makes you more effectively choose which algorithm to run Understanding the

More information

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal Sub-Sampled Newton Methods for Machine Learning Jorge Nocedal Northwestern University Goldman Lecture, Sept 2016 1 Collaborators Raghu Bollapragada Northwestern University Richard Byrd University of Colorado

More information

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):

More information

Non-asymptotic bound for stochastic averaging

Non-asymptotic bound for stochastic averaging Non-asymptotic bound for stochastic averaging S. Gadat and F. Panloup Toulouse School of Economics PGMO Days, November, 14, 2017 I - Introduction I - 1 Motivations I - 2 Optimization I - 3 Stochastic Optimization

More information

Stochastic Analogues to Deterministic Optimizers

Stochastic Analogues to Deterministic Optimizers Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer Agenda 1. Learning as mathematical optimization Stochastic optimization, ERM,

More information

Proximal and First-Order Methods for Convex Optimization

Proximal and First-Order Methods for Convex Optimization Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,

More information

Modern Optimization Techniques

Modern Optimization Techniques Modern Optimization Techniques 2. Unconstrained Optimization / 2.2. Stochastic Gradient Descent Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University

More information

Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

dans les modèles à vraisemblance non explicite par des algorithmes gradient-proximaux perturbés

dans les modèles à vraisemblance non explicite par des algorithmes gradient-proximaux perturbés Inférence pénalisée dans les modèles à vraisemblance non explicite par des algorithmes gradient-proximaux perturbés Gersende Fort Institut de Mathématiques de Toulouse, CNRS and Univ. Paul Sabatier Toulouse,

More information

Non-Smooth, Non-Finite, and Non-Convex Optimization

Non-Smooth, Non-Finite, and Non-Convex Optimization Non-Smooth, Non-Finite, and Non-Convex Optimization Deep Learning Summer School Mark Schmidt University of British Columbia August 2015 Complex-Step Derivative Using complex number to compute directional

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information