Large-scale machine learning and convex optimization

Size: px

Start display at page:

Download "Large-scale machine learning and convex optimization"

Shonda Price
6 years ago
Views:

1 Large-scale machine learning and convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE IFCAM, Bangalore - July 2014

2 Big data revolution? A new scientific context Data everywhere: size does not (always) matter Science and industry Size and variety Learning from examples n observations in dimension d

3 Search engines - advertising

4 Search engines - Advertising

5 Marketing - Personalized recommendation

6 Visual object recognition

7 Personal photos

8 Bioinformatics Protein: Crucial elements of cell life Massive data: 2 millions for humans Complex data

9 Context Machine learning for big data Large-scale machine learning: large d, large n d : dimension of each observation (input) n : number of observations Examples: computer vision, bioinformatics, advertising Ideal running-time complexity: O(dn) Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951) Mixing statistics and optimization Using smoothness to go beyond stochastic gradient descent

10 Context Machine learning for big data Large-scale machine learning: large d, large n d : dimension of each observation (input) n : number of observations Examples: computer vision, bioinformatics, advertising Ideal running-time complexity: O(dn) Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951) Mixing statistics and optimization Using smoothness to go beyond stochastic gradient descent

11 Context Machine learning for big data Large-scale machine learning: large d, large n d : dimension of each observation (input) n : number of observations Examples: computer vision, bioinformatics, advertising Ideal running-time complexity: O(dn) Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951) Mixing statistics and optimization Using smoothness to go beyond stochastic gradient descent

12 Outline 1. Large-scale machine learning and optimization Traditional statistical analysis Classical methods for convex optimization 2. Non-smooth stochastic approximation Stochastic (sub)gradient and averaging Non-asymptotic results and lower bounds Strongly convex vs. non-strongly convex 3. Smooth stochastic approximation algorithms Asymptotic and non-asymptotic results 4. Beyond decaying step-sizes 5. Finite data sets

13 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ Φ(x) of features Φ(x) R d (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i,θ Φ(x i ) ) + µω(θ) θ R d n i=1 convex data fitting term + regularizer

14 Usual losses Regression: y R, prediction ŷ = θ Φ(x) quadratic loss 1 2 (y ŷ)2 = 1 2 (y θ Φ(x)) 2

15 Usual losses Regression: y R, prediction ŷ = θ Φ(x) quadratic loss 1 2 (y ŷ)2 = 1 2 (y θ Φ(x)) 2 Classification : y { 1,1}, prediction ŷ = sign(θ Φ(x)) loss of the form l(yθ Φ(x)) True 0-1 loss: l(yθ Φ(x)) = 1 yθ Φ(x)<0 Usual convex losses: hinge square logistic

16 Main motivating examples Support vector machine (hinge loss) l(y,θ Φ(X)) = max{1 Yθ Φ(X),0} Logistic regression l(y,θ Φ(X)) = log(1+exp( Yθ Φ(X))) Least-squares regression l(y,θ Φ(X)) = 1 2 (Y θ Φ(X)) 2

17 Usual regularizers Main goal: avoid overfitting (squared) Euclidean norm: θ 2 2 = d j=1 θ j 2 Numerically well-behaved Representer theorem and kernel methods : θ = n i=1 α iφ(x i ) See, e.g., Schölkopf and Smola (2001); Shawe-Taylor and Cristianini (2004) Sparsity-inducing norms Main example: l 1 -norm θ 1 = d j=1 θ j Perform model selection as well as regularization Non-smooth optimization and structured sparsity See, e.g., Bach, Jenatton, Mairal, and Obozinski (2011, 2012)

18 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ Φ(x) of features Φ(x) R d (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i,θ Φ(x i ) ) + µω(θ) θ R d n i=1 convex data fitting term + regularizer

19 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ Φ(x) of features Φ(x) R d (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i,θ Φ(x i ) ) + µω(θ) θ R d n i=1 convex data fitting term + regularizer Empirical risk: ˆf(θ) = 1 n n i=1 l(y i,θ Φ(x i )) training cost Expected risk: f(θ) = E (x,y) l(y,θ Φ(x)) testing cost Two fundamental questions: (1) computing ˆθ and (2) analyzing ˆθ May be tackled simultaneously

20 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ Φ(x) of features Φ(x) R d (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i,θ Φ(x i ) ) + µω(θ) θ R d n i=1 convex data fitting term + regularizer Empirical risk: ˆf(θ) = 1 n n i=1 l(y i,θ Φ(x i )) training cost Expected risk: f(θ) = E (x,y) l(y,θ Φ(x)) testing cost Two fundamental questions: (1) computing ˆθ and (2) analyzing ˆθ May be tackled simultaneously

21 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ Φ(x) of features Φ(x) R d (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i,θ Φ(x i ) ) such that Ω(θ) D θ R d n i=1 convex data fitting term + constraint Empirical risk: ˆf(θ) = 1 n n i=1 l(y i,θ Φ(x i )) training cost Expected risk: f(θ) = E (x,y) l(y,θ Φ(x)) testing cost Two fundamental questions: (1) computing ˆθ and (2) analyzing ˆθ May be tackled simultaneously

22 General assumptions Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Bounded features Φ(x) R d : Φ(x) 2 R Empirical risk: ˆf(θ) = 1 n n i=1 l(y i,θ Φ(x i )) training cost Expected risk: f(θ) = E (x,y) l(y,θ Φ(x)) testing cost Loss for a single observation: f i (θ) = l(y i,θ Φ(x i )) i, f(θ) = Ef i (θ) Properties of f i,f, ˆf Convex on R d Additional regularity assumptions: Lipschitz-continuity, smoothness and strong convexity

23 Lipschitz continuity Bounded gradients of f (Lipschitz-continuity): the function f if convex, differentiable and has (sub)gradients uniformly bounded by B on the ball of center 0 and radius D: Machine learning θ R d, θ 2 D f (θ) 2 B with f(θ) = 1 n n i=1 l(y i,θ Φ(x i )) G-Lipschitz loss and R-bounded data: B = GR

24 Smoothness and strong convexity A function f : R d R is L-smooth if and only if it is differentiable and its gradient is L-Lipschitz-continuous θ 1,θ 2 R d, f (θ 1 ) f (θ 2 ) 2 L θ 1 θ 2 2 If f is twice differentiable: θ R d, f (θ) L Id smooth non smooth

25 Smoothness and strong convexity A function f : R d R is L-smooth if and only if it is differentiable and its gradient is L-Lipschitz-continuous θ 1,θ 2 R d, f (θ 1 ) f (θ 2 ) 2 L θ 1 θ 2 2 If f is twice differentiable: θ R d, f (θ) L Id Machine learning with f(θ) = 1 n n i=1 l(y i,θ Φ(x i )) Hessian covariance matrix 1 n n i=1 Φ(x i)φ(x i ) l-smooth loss and R-bounded data: L = lr 2

26 Smoothness and strong convexity A function f : R d R is µ-strongly convex if and only if θ 1,θ 2 R d, f(θ 1 ) f(θ 2 )+f (θ 2 ) (θ 1 θ 2 )+ µ 2 θ 1 θ If f is twice differentiable: θ R d, f (θ) µ Id convex strongly convex

27 Smoothness and strong convexity A function f : R d R is µ-strongly convex if and only if θ 1,θ 2 R d, f(θ 1 ) f(θ 2 )+f (θ 2 ) (θ 1 θ 2 )+ µ 2 θ 1 θ If f is twice differentiable: θ R d, f (θ) µ Id Machine learning with f(θ) = 1 n n i=1 l(y i,θ Φ(x i )) Hessian covariance matrix 1 n n i=1 Φ(x i)φ(x i ) Data with invertible covariance matrix (low correlation/dimension)

28 Smoothness and strong convexity A function f : R d R is µ-strongly convex if and only if θ 1,θ 2 R d, f(θ 1 ) f(θ 2 )+f (θ 2 ) (θ 1 θ 2 )+ µ 2 θ 1 θ If f is twice differentiable: θ R d, f (θ) µ Id Machine learning with f(θ) = 1 n n i=1 l(y i,θ Φ(x i )) Hessian covariance matrix 1 n n i=1 Φ(x i)φ(x i ) Data with invertible covariance matrix (low correlation/dimension) Adding regularization by µ 2 θ 2 creates additional bias unless µ is small

29 Summary of smoothness/convexity assumptions Bounded gradients of f (Lipschitz-continuity): the function f if convex, differentiable and has (sub)gradients uniformly bounded by B on the ball of center 0 and radius D: θ R d, θ 2 D f (θ) 2 B Smoothness of f: the function f is convex, differentiable with L-Lipschitz-continuous gradient f : θ 1,θ 2 R d, f (θ 1 ) f (θ 2 ) 2 L θ 1 θ 2 2 Strong convexity of f: The function f is strongly convex with respect to the norm, with convexity constant µ > 0: θ 1,θ 2 R d, f(θ 1 ) f(θ 2 )+f (θ 2 ) (θ 1 θ 2 )+ µ 2 θ 1 θ 2 2 2

30 Analysis of empirical risk minimization Approximation and estimation errors: C = {θ R d,ω(θ) D} [ ] [ ] f(ˆθ) min θ R df(θ) = f(ˆθ) min f(θ) + minf(θ) min θ C θ C θ R df(θ) NB: may replace min by best (non-linear) predictions θ Rdf(θ) 1. Uniform deviation bounds, with ˆθ argmin θ C ˆf(θ) f(ˆθ) min θ C f(θ) 2sup θ C Typically slow rate O ( 1 n ) ˆf(θ) f(θ) (proof) 2. More refined concentration results with faster rates

31 Motivation from least-squares For least-squares, we have l(y,θ Φ(x)) = 1 2 (y θ Φ(x)) 2, and sup θ 2 D f(θ) ˆf(θ) = 1 ( 1 n 2 θ Φ(x i )Φ(x i ) EΦ(X)Φ(X) )θ n i=1 ( 1 n ) θ y i Φ(x i ) EYΦ(X) + 1 ( 1 n yi 2 EY ), 2 n 2 n i=1 i=1 1 n i )Φ(x i ) n i=1φ(x EΦ(X)Φ(X) op +D 1 n y i Φ(x i ) EYΦ(X) n n 2 2 yi 2 EY 2, n f(θ) ˆf(θ) D2 2 i=1 sup f(θ) ˆf(θ) O(1/ n) with high probability θ 2 D i=1

32 Slow rate for supervised learning Assumptions (f is the expected risk, ˆf the empirical risk) Ω(θ) = θ 2 (Euclidean norm) Linear predictors: θ(x) = θ Φ(x), with Φ(x) 2 R a.s. G-Lipschitz loss: f and ˆf are GR-Lipschitz on C = { θ 2 D} No assumptions regarding convexity

33 Slow rate for supervised learning Assumptions (f is the expected risk, ˆf the empirical risk) Ω(θ) = θ 2 (Euclidean norm) Linear predictors: θ(x) = θ Φ(x), with Φ(x) 2 R a.s. G-Lipschitz loss: f and ˆf are GR-Lipschitz on C = { θ 2 D} No assumptions regarding convexity With probability greater than 1 δ sup θ C ˆf(θ) f(θ) GRD n [2+ Expectated estimation error: E [ sup θ C 2log 2 δ ] ˆf(θ) f(θ) ] 4GRD n Using Rademacher averages (see, e.g., Boucheron et al., 2005) Lipschitz functions slow rate

34 Symmetrization with Rademacher variables Let D = {x 1,y 1,...,x n,y n} an independent copy of the data D = {x 1,y 1,...,x n,y n }, with corresponding loss functions f i (θ) E [ sup f(θ) ˆf(θ) ] = E [ ( sup f(θ) 1 n ) ] f i (θ) θ Θ θ Θ n [ i=1 = E sup 1 n E ( f θ Θ n i(θ) f i (θ) D ) ] i=1 [ [ E E sup 1 n ]] ( f θ Θ n i (θ) f i (θ) D i=1 [ = E sup 1 n ( f θ Θ n i (θ) f i (θ) ) ] i=1 [ = E sup 1 n ( ε i f θ Θ n i (θ) f i (θ) ) ] with ε i uniform in { 1,1} i=1 [ 2E 1 n ] ε i f i (θ) n = Rademacher complexity sup θ Θ i=1

35 Rademacher complexity DefinetheRademachercomplexity of theclass of functions(x,y) l(y,θ Φ(X)) as [ R n = E sup θ Θ 1 n n ] ε i f i (θ). Note two expectations, with respect to D and with respect to ε i=1 Main property: E [ ] sup f(θ) ˆf(θ) 2Rn θ Θ

36 From Rademacher complexity to uniform bound Let Z = sup θ Θ f(θ) ˆf(θ) By changing the pair (x i,y i ), Z may only change by 2 n sup l(y,θ Φ(X)) 2 ( ) 2( sup l(y,0) +GRD l0 +GRD ) = c n n with sup l(y,0) = l 0 MacDiarmid inequality: with probability greater than 1 δ, Z EZ + n 2 c log 1 δ 2R n+ 2 n ( l0 +GRD ) log 1 δ

37 Bounding the Rademacher average - I Wehave,withϕ i (u) = l(y i,u) l(y i,0)isalmostsurelyb-lipschitz: [ R n = E sup 1 θ Θ n [ E sup 1 θ Θ n l [ 0 +E n n ] ε i f i (θ) i=1 n ] ε i f i (0) i=1 sup θ Θ = l [ 0 +E sup n θ Θ 1 n 1 n [ +E sup θ Θ 1 n n [ ε i fi (θ) f i (0) ] ] i=1 n [ ε i fi (θ) f i (0) ]] i=1 n i=1 ] ε i ϕ i (θ Φ(x i )) Using Ledoux-Talagrand concentration results for Rademacher averages (since ϕ i is G-Lipschitz, we get: R n l [ 0 +2G E sup 1 n θ 2 D n n ε i θ Φ(x i )] i=1

38 We have: Bounding the Rademacher average - II R n l [ 0 +2GE sup 1 n n ε i θ Φ(x i )] θ 2 D n i=1 = l 0 n +2GE n D1 ε i Φ(x i ) n i=1 2 l 0 +2GD 1 n 2 E ε i Φ(x i ) n n 2(l 0+GRD) n i=1 2 Overall, we get, with probability 1 δ: sup f(θ) ˆf(θ) 1 ( n l0 +GRD)(4+ θ Θ 2log 1 δ )

39 Putting it all together We have, with probability 1 δ, for all θ Θ: f(θ) f(θ ) [ f(θ) ˆf(θ) ] + [ˆf(θ) min θ Θ 2 (l 0 +GRD)(4+ n ˆf(θ ) ] + [ min θ Θ 2log 1 δ )+[ˆf(θ) min θ Θ ˆf(θ ) ˆf(θ ) ] ˆf(θ ) ] Only need to optimize with precision 2 n (l 0 +GRD)

40 Slow rate for supervised learning (summary) Assumptions (f is the expected risk, ˆf the empirical risk) Ω(θ) = θ 2 (Euclidean norm) Linear predictors: θ(x) = θ Φ(x), with Φ(x) 2 R a.s. G-Lipschitz loss: f and ˆf are GR-Lipschitz on C = { θ 2 D} No assumptions regarding convexity With probability greater than 1 δ sup θ C ˆf(θ) f(θ) (l 0+GRD) n [2+ Expectated estimation error: E [ sup θ C 2log 2 δ ˆf(θ) f(θ) ] 4(l 0+GRD) n Using Rademacher averages (see, e.g., Boucheron et al., 2005) Lipschitz functions slow rate ]

41 Motivation from mean estimation Estimator ˆθ = 1 n n i=1 z i = argmin θ R 1 2n n i=1 (θ z i) 2 = ˆf(θ) From before: f(θ) = 1 2 E(θ z)2 = 1 2 (θ Ez) var(z) = ˆf(θ)+O(1/ n) f(ˆθ) = 1 2 (ˆθ Ez) var(z) = f(ez)+o(1/ n)

42 Motivation from mean estimation Estimator ˆθ = 1 n n i=1 z i = argmin θ R 1 2n n i=1 (θ z i) 2 = ˆf(θ) From before: f(θ) = 1 2 E(θ z)2 = 1 2 (θ Ez) var(z) = ˆf(θ)+O(1/ n) f(ˆθ) = 1 2 (ˆθ Ez) var(z) = f(ez)+o(1/ n) More refined/direct bound: f(ˆθ) f(ez) = 1 2 (ˆθ Ez) 2 E [ f(ˆθ) f(ez) ] = 1 2 E ( 1 n n i=1 ) 2 z i Ez = 1 2n var(z) Bound only at ˆθ + strong convexity

43 Fast rate for supervised learning Assumptions (f is the expected risk, ˆf the empirical risk) Same as before (bounded features, Lipschitz loss) Regularized risks: f µ (θ) = f(θ)+ µ 2 θ 2 2 and ˆf µ (θ) = ˆf(θ)+ µ 2 θ 2 2 Convexity For any a > 0, with probability greater than 1 δ, for all θ R d, f µ (θ) min (η) (1+a)(ˆf µ (θ) min ˆfµ (η))+ 8(1+ 1 a )G2 R 2 (32+log 1 δ ) η R dfµ η R d µn Results from Sridharan, Srebro, and Shalev-Shwartz (2008) see also Boucheron and Massart (2011) and references therein Strongly convex functions fast rate Warning: µ should decrease with n to reduce approximation error

44 Outline 1. Large-scale machine learning and optimization Traditional statistical analysis Classical methods for convex optimization 2. Non-smooth stochastic approximation Stochastic (sub)gradient and averaging Non-asymptotic results and lower bounds Strongly convex vs. non-strongly convex 3. Smooth stochastic approximation algorithms Asymptotic and non-asymptotic results 4. Beyond decaying step-sizes 5. Finite data sets

45 Complexity results in convex optimization Assumption: f convex on R d Classical generic algorithms (sub)gradient method/descent Accelerated gradient descent Newton method Key additional properties of f Lipschitz continuity, smoothness or strong convexity Key insight from Bottou and Bousquet (2008) In machine learning, no need to optimize below estimation error Key reference: Nesterov (2004)

46 Assumptions Subgradient method/descent f convex and B-Lipschitz-continuous on { θ 2 D} ( Algorithm: θ t = Π D θ t 1 2D ) B t f (θ t 1 ) Π D : orthogonal projection onto { θ 2 D} Bound: Three-line proof f ( 1 t t 1 k=0 θ k ) f(θ ) 2DB t Best possible convergence rate after O(d) iterations

47 Subgradient method/descent - proof - I Iteration: θ t = Π D (θ t 1 γ t f (θ t 1 )) with γ t = 2D B t Assumption: f (θ) 2 B and θ 2 D θ t θ 2 2 θ t 1 θ γ t f (θ t 1 ) 2 2 by contractivity of projections θ t 1 θ 2 2+B 2 γ 2 t 2γ t (θ t 1 θ ) f (θ t 1 ) because f (θ t 1 ) 2 B θ t 1 θ 2 2+B 2 γ 2 t 2γ t [ f(θt 1 ) f(θ ) ] (property of subgradients) leading to f(θ t 1 ) f(θ ) B2 γ t [ θt 1 θ 2 2γ 2 θ t θ 2 ] 2 t

48 Starting from Subgradient method/descent - proof - II t [ f(θu 1 ) f(θ ) ] u=1 f(θ t 1 ) f(θ ) B2 γ t 2 = = Using convexity: f t u=1 t u=1 ( 1 t t u=1 t u=1 B 2 γ u 2 B 2 γ u 2 B 2 γ u 2 B 2 γ u 2 t 1 k=0 + + t u=1 t 1 u=1 + θ k ) t 1 u=1 + 1 [ θt 1 θ 2 2γ 2 θ t θ 2 ] 2 t 1 [ θu 1 θ 2 2γ 2 θ u θ 2 ] 2 u ( θ u θ ) θ 0 θ θ t θ 2 2 2γ u+1 2γ u 2γ 1 2γ t 4D 2( 1 2γ u+1 1 2γ u ) + 4D 2 2γ 1 + 4D2 2γ t 2DB t with γ t = 2D B t f(θ ) 2DB t

49 Subgradient descent for machine learning Assumptions (f is the expected risk, ˆf the empirical risk) Linear predictors: θ(x) = θ Φ(x), with Φ(x) 2 R a.s. ˆf(θ) = 1 n n i=1 l(y i,φ(x i ) θ) G-Lipschitz loss: f and ˆf are GR-Lipschitz on C = { θ 2 D} Statistics: with probability greater than 1 δ ˆf(θ) f(θ) GRD [2+ n sup θ C 2log 2 δ ] Optimization: after t iterations of subgradient method ˆf(ˆθ) min η C ˆf(η) GRD t t = n iterations, with total running-time complexity of O(n 2 d)

50 Subgradient descent - strong convexity Assumptions f convex and B-Lipschitz-continuous on { θ 2 D} f µ-strongly convex ) 2 Algorithm: θ t = Π D (θ t 1 µ(t+1) f (θ t 1 ) Bound: f ( 2 t(t+1) t k=1 kθ k 1 ) f(θ ) 2B2 µ(t+1) Three-line proof Best possible convergence rate after O(d) iterations

51 Subgradient method - strong convexity - proof - I Iteration: θ t = Π D (θ t 1 γ t f (θ t 1 )) with γ t = 2 µ(t+1) Assumption: f (θ) 2 B and θ 2 D and µ-strong convexity of f θ t θ 2 2 θ t 1 θ γ t f (θ t 1 ) 2 2 by contractivity of projections θ t 1 θ 2 2+B 2 γt 2 2γ t (θ t 1 θ ) f (θ t 1 ) because f (θ t 1 ) 2 B θ t 1 θ 2 2+B 2 γt 2 [ 2γ t f(θt 1 ) f(θ )+ µ 2 θ t 1 θ 2 ] 2 (property of subgradients and strong convexity) leading to f(θ t 1 ) f(θ ) B2 γ t B 2 µ(t+1) + µ 2 [1 γ t µ ] θ t 1 θ γ t θ t θ 2 2 [t 1 2 ] θt 1 θ 2 2 µ(t+1) θ t θ 2 2 4

52 Subgradient method - strong convexity - proof - II From f(θ t 1 ) f(θ ) B2 µ(t+1) + µ 2 [t 1 2 ] θt 1 θ 2 2 µ(t+1) θ t θ t u [ f(θ u 1 ) f(θ ) ] u=1 u t=1 B 2 u µ(u+1) t [ u(u 1) θu 1 θ 2 2 u(u+1) θ u θ 2 2] u=1 B2 t µ + 1 [ 0 t(t+1) θt θ 2 B 4 2] 2 t µ Using convexity: f ( 2 t(t+1) t u=1 uθ u 1 ) f(θ ) 2B2 t+1

53 (smooth) gradient descent Assumptions f convex with L-Lipschitz-continuous gradient Minimum attained at θ Algorithm: Bound: Three-line proof θ t = θ t 1 1 L f (θ t 1 ) f(θ t ) f(θ ) 2L θ 0 θ 2 t+4 Not best possible convergence rate after O(d) iterations

54 (smooth) gradient descent - strong convexity Assumptions f convex with L-Lipschitz-continuous gradient f µ-strongly convex Algorithm: Bound: θ t = θ t 1 1 L f (θ t 1 ) f(θ t ) f(θ ) (1 µ/l) t[ f(θ 0 ) f(θ ) ] Three-line proof Adaptivity of gradient descent to problem difficulty Line search

55 Accelerated gradient methods (Nesterov, 1983) Assumptions f convex with L-Lipschitz-cont. gradient, min. attained at θ Algorithm: θ t = η t 1 1 L f (η t 1 ) Bound: η t = θ t + t 1 t+2 (θ t θ t 1 ) f(θ t ) f(θ ) 2L θ 0 θ 2 (t+1) 2 Ten-line proof (see, e.g., Schmidt, Le Roux, and Bach, 2011) Not improvable Extension to strongly convex functions

56 Optimization for sparsity-inducing norms (see Bach, Jenatton, Mairal, and Obozinski, 2011) Gradient descent as a proximal method (differentiable functions) θ t+1 = arg min θ R df(θ t)+(θ θ t ) f(θ t )+ L 2 θ θ t 2 2 θ t+1 = θ t 1 L f(θ t)

57 Optimization for sparsity-inducing norms (see Bach, Jenatton, Mairal, and Obozinski, 2011) Gradient descent as a proximal method (differentiable functions) θ t+1 = arg min θ R df(θ t)+(θ θ t ) f(θ t )+ L 2 θ θ t 2 2 θ t+1 = θ t 1 L f(θ t) Problems of the form: min θ R df(θ)+µω(θ) θ t+1 = arg min θ R df(θ t)+(θ θ t ) f(θ t )+µω(θ)+ L 2 θ θ t 2 2 Ω(θ) = θ 1 Thresholded gradient descent Similar convergence rates than smooth optimization Acceleration methods (Nesterov, 2007; Beck and Teboulle, 2009)

58 Summary: minimizing convex functions Assumption: f convex Gradient descent: θ t = θ t 1 γ t f (θ t 1 ) O(1/ t) convergence rate for non-smooth convex functions O(1/t) convergence rate for smooth convex functions O(e ρt ) convergence rate for strongly smooth convex functions Newton method: θ t = θ t 1 f (θ t 1 ) 1 f (θ t 1 ) O ( e ρ2t ) convergence rate

59 Summary: minimizing convex functions Assumption: f convex Gradient descent: θ t = θ t 1 γ t f (θ t 1 ) O(1/ t) convergence rate for non-smooth convex functions O(1/t) convergence rate for smooth convex functions O(e ρt ) convergence rate for strongly smooth convex functions Newton method: θ t = θ t 1 f (θ t 1 ) 1 f (θ t 1 ) O ( e ρ2t ) convergence rate Key insights from Bottou and Bousquet (2008) 1. In machine learning, no need to optimize below statistical error 2. In machine learning, cost functions are averages Stochastic approximation

60 Outline 1. Large-scale machine learning and optimization Traditional statistical analysis Classical methods for convex optimization 2. Non-smooth stochastic approximation Stochastic (sub)gradient and averaging Non-asymptotic results and lower bounds Strongly convex vs. non-strongly convex 3. Smooth stochastic approximation algorithms Asymptotic and non-asymptotic results 4. Beyond decaying step-sizes 5. Finite data sets

61 Stochastic approximation Goal: Minimizing a function f defined on R d given only unbiased estimates f n(θ n ) of its gradients f (θ n ) at certain points θ n R d

62 Stochastic approximation Goal: Minimizing a function f defined on R d given only unbiased estimates f n(θ n ) of its gradients f (θ n ) at certain points θ n R d Machine learning - statistics loss for a single pair of observations: f n (θ) = l(y n,θ Φ(x n )) f(θ) = Ef n (θ) = El(y n,θ Φ(x n )) = generalization error Expected gradient: f (θ) = Ef n(θ) = E { l (y n,θ Φ(x n ))Φ(x n ) } Non-asymptotic results Number of iterations = number of observations

63 Stochastic approximation Goal: Minimizing a function f defined on R d given only unbiased estimates f n (θ n) of its gradients f (θ n ) at certain points θ n R d Stochastic approximation (much) broader applicability beyond convex optimization θ n = θ n 1 γ n h n (θ n 1 ) with E [ h n (θ n 1 ) θ n 1 ] = h(θn 1 ) Beyond convex problems, i.i.d assumption, finite dimension, etc. Typically asymptotic results See, e.g., Kushner and Yin (2003); Borkar (2008); Benveniste et al. (2012)

64 Stochastic approximation Relationship to online learning Minimize f(θ) = E z l(θ,z) = generalization error of θ Using the gradients of single i.i.d. observations

65 Stochastic approximation Relationship to online learning Minimize f(θ) = E z l(θ,z) = generalization error of θ Using the gradients of single i.i.d. observations Batch learning Finite set of observations: z 1,...,z n Empirical risk: ˆf(θ) = 1 n n k=1 l(θ,z i) Estimator ˆθ = Minimizer of ˆf(θ) over a certain class Θ Generalization bound using uniform concentration results

66 Stochastic approximation Relationship to online learning Minimize f(θ) = E z l(θ,z) = generalization error of θ Using the gradients of single i.i.d. observations Batch learning Finite set of observations: z 1,...,z n Empirical risk: ˆf(θ) = 1 n n k=1 l(θ,z i) Estimator ˆθ = Minimizer of ˆf(θ) over a certain class Θ Generalization bound using uniform concentration results Online learning Update ˆθ n after each new (potentially adversarial) observation z n Cumulative loss: 1 n n k=1 l(ˆθ k 1,z k ) Online to batch through averaging (Cesa-Bianchi et al., 2004)

67 Convex stochastic approximation Key properties of f and/or f n Smoothness: f B-Lipschitz continuous, f L-Lipschitz continuous Strong convexity: f µ-strongly convex

68 Convex stochastic approximation Key properties of f and/or f n Smoothness: f B-Lipschitz continuous, f L-Lipschitz continuous Strong convexity: f µ-strongly convex Key algorithm: Stochastic gradient descent (a.k.a. Robbins-Monro) θ n = θ n 1 γ n f n(θ n 1 ) Polyak-Ruppert averaging: θ n = 1 n 1 n k=0 θ k Which learning rate sequence γ n? Classical setting: γ n = Cn α

69 Convex stochastic approximation Key properties of f and/or f n Smoothness: f B-Lipschitz continuous, f L-Lipschitz continuous Strong convexity: f µ-strongly convex Key algorithm: Stochastic gradient descent (a.k.a. Robbins-Monro) θ n = θ n 1 γ n f n(θ n 1 ) Polyak-Ruppert averaging: θ n = 1 n 1 n k=0 θ k Which learning rate sequence γ n? Classical setting: γ n = Cn α Desirable practical behavior Applicable (at least) to classical supervised learning problems Robustness to (potentially unknown) constants (L,B,µ) Adaptivity to difficulty of the problem (e.g., strong convexity)

70 Assumptions Stochastic subgradient descent/method f n convex and B-Lipschitz-continuous on { θ 2 D} (f n ) i.i.d. functions such that Ef n = f θ global optimum of f on { θ 2 D} ( Algorithm: θ n = Π D θ n 1 2D ) B n f n(θ n 1 ) Bound: Ef ( 1 n n 1 k=0 θ k ) f(θ ) 2DB n Same three-line proof as in the deterministic case Minimax convergence rate Running-time complexity: O(dn) after n iterations

71 Stochastic subgradient method - proof - I Iteration: θ n = Π D (θ n 1 γ n f n(θ n 1 )) with γ n = 2D B n F n : information up to time n f n(θ) 2 B and θ 2 D, unbiased gradients/functions E(f n F n 1 ) = f θ n θ 2 2 θ n 1 θ γ n f n(θ n 1 ) 2 2 by contractivity of projections θ n 1 θ 2 2 +B 2 γ 2 n 2γ n (θ n 1 θ ) f n(θ n 1 ) because f n(θ n 1 ) 2 B E [ θ n θ 2 2 F n 1 ] θn 1 θ 2 2+B 2 γ 2 n 2γ n (θ n 1 θ ) f (θ n 1 ) θ n 1 θ 2 2+B 2 γ 2 n 2γ n [ f(θn 1 ) f(θ ) ] (subgradient property) E θ n θ 2 2 E θ n 1 θ 2 2+B 2 γ 2 n 2γ n [ Ef(θn 1 ) f(θ ) ] leading to Ef(θ n 1 ) f(θ ) B2 γ n [ ] E θn 1 θ 2 2γ 2 E θ n θ 2 2 n

72 Stochastic subgradient method - proof - II Starting from Ef(θ n 1 ) f(θ ) B2 γ n [ ] E θn 1 θ 2 2 E θ n θ 2 2 2γ n n [ Ef(θu 1 ) f(θ ) ] u=1 n u=1 n u=1 B 2 γ u 2 B 2 γ u 2 + n u=1 1 [ E θu 1 θ 2 2γ 2 E θ u θ 2 ] 2 u + 4D2 2γ n 2DB n with γ n = 2D B n Using convexity: Ef ( 1 n n 1 k=0 θ k ) f(θ ) 2DB n

73 Stochastic subgradient descent - strong convexity - I Assumptions f n convex and B-Lipschitz-continuous (f n ) i.i.d. functions such that Ef n = f f µ-strongly convex on { θ 2 D} θ global optimum of f over { θ 2 D} ) 2 Algorithm: θ n = Π D (θ n 1 µ(n+1) f n(θ n 1 ) Bound: ( 2 Ef n(n+1) n ) kθ k 1 k=1 f(θ ) 2B2 µ(n+1) Same three-line proof than in the deterministic case Minimax convergence rate

74 Stochastic subgradient descent - strong convexity - II Assumptions f n convex and B-Lipschitz-continuous (f n ) i.i.d. functions such that Ef n = f θ global optimum of g = f + µ No compactness assumption - no projections Algorithm: 2 2 [ ] θ n = θ n 1 µ(n+1) g n(θ n 1 ) = θ n 1 f µ(n+1) n (θ n 1 )+µθ n 1 ( 2 Bound: Eg n(n+1) n ) kθ k 1 k=1 Minimax convergence rate g(θ ) 2B2 µ(n+1)

75 Outline 1. Large-scale machine learning and optimization Traditional statistical analysis Classical methods for convex optimization 2. Non-smooth stochastic approximation Stochastic (sub)gradient and averaging Non-asymptotic results and lower bounds Strongly convex vs. non-strongly convex 3. Smooth stochastic approximation algorithms Asymptotic and non-asymptotic results 4. Beyond decaying step-sizes 5. Finite data sets

76 Convex stochastic approximation Existing work Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Bottou and Le Cun (2005); Bottou and Bousquet (2008); Hazan et al. (2007); Shalev-Shwartz and Srebro (2008); Shalev-Shwartz et al. (2007, 2009); Xiao (2010); Duchi and Singer (2009); Nesterov and Vial (2008); Nemirovski et al. (2009)

77 Convex stochastic approximation Existing work Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Many contributions in optimization and online learning: Bottou and Le Cun (2005); Bottou and Bousquet (2008); Hazan et al. (2007); Shalev-Shwartz and Srebro (2008); Shalev-Shwartz et al. (2007, 2009); Xiao (2010); Duchi and Singer (2009); Nesterov and Vial (2008); Nemirovski et al. (2009)

78 Convex stochastic approximation Existing work Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Asymptotic analysis of averaging (Polyak and Juditsky, 1992; Ruppert, 1988) All step sizes γ n = Cn α with α (1/2,1) lead to O(n 1 ) for smooth strongly convex problems A single algorithm with global adaptive convergence rate for smooth problems?

79 Convex stochastic approximation Existing work Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Asymptotic analysis of averaging (Polyak and Juditsky, 1992; Ruppert, 1988) All step sizes γ n = Cn α with α (1/2,1) lead to O(n 1 ) for smooth strongly convex problems Non-asymptotic analysis for smooth problems? problems with convergence rate O(min{1/µn,1/ n})

80 Smoothness/convexity assumptions Iteration: θ n = θ n 1 γ n f n(θ n 1 ) Polyak-Ruppert averaging: θ n = 1 n n 1 k=0 θ k Smoothness of f n : For each n 1, the function f n is a.s. convex, differentiable with L-Lipschitz-continuous gradient f n: Smooth loss and bounded data Strong convexity of f: The function f is strongly convex with respect to the norm, with convexity constant µ > 0: Invertible population covariance matrix or regularization by µ 2 θ 2

81 Summary of new results (Bach and Moulines, 2011) Stochastic gradient descent with learning rate γ n = Cn α Strongly convex smooth objective functions Old: O(n 1 ) rate achieved without averaging for α = 1 New: O(n 1 ) rate achieved with averaging for α [1/2,1] Non-asymptotic analysis with explicit constants Forgetting of initial conditions Robustness to the choice of C

82 Summary of new results (Bach and Moulines, 2011) Stochastic gradient descent with learning rate γ n = Cn α Strongly convex smooth objective functions Old: O(n 1 ) rate achieved without averaging for α = 1 New: O(n 1 ) rate achieved with averaging for α [1/2,1] Non-asymptotic analysis with explicit constants Forgetting of initial conditions Robustness to the choice of C Convergence rates for E θ n θ 2 and E θ n θ 2 ( σ 2 γ ) n no averaging: O +O(e µnγ n ) θ 0 θ 2 µ averaging: trh(θ ) 1 ( +µ 1 O(n 2α +n 2+α θ0 θ 2 ) )+O n µ 2 n 2

83 Classical proof sketch (no averaging) θ n θ 2 2 = θ n 1 γ n f n(θ n 1 ) θ 2 2 = θ n 1 θ 2 2 2γ n (θ n 1 θ ) f n(θ n 1 )+γ 2 n f n(θ n 1 ) 2 2 θ n 1 θ 2 2 2γ n (θ n 1 θ ) f n(θ n 1 ) E [ θ n θ 2 2 F n 1 ] +2γn f 2 n(θ ) 2 2+2γn f 2 n(θ n 1 ) f n(θ ) 2 2 θ n 1 θ 2 2 2γ n (θ n 1 θ ) f n(θ n 1 ) +2γn f 2 n(θ ) 2 2+2γnL[f 2 n(θ n 1 ) f n(θ )] (θ n 1 θ ) θ n 1 θ 2 2 2γ n (θ n 1 θ ) f (θ n 1 ) +2γnE f 2 n(θ ) 2 2+2γnL[f 2 (θ n 1 ) 0] (θ n 1 θ ) θ n 1 θ 2 2 2γ n (1 γ n L)(θ n 1 θ ) f (θ n 1 )+2γnσ 2 2 θ n 1 θ 2 2 2γ n (1 γ n L) 1 2 µ θ n 1 θ 2 2+2γ 2 nσ 2 E [ θ n 1 θ 2 ] 2 = [ 1 µγ n (1 γ n L) ] θ n 1 θ 2 2+2γnσ 2 2 [ 1 µγ n (1 γ n L) ] E [ θ n 1 θ 2 ] 2 +2γ 2 n σ 2

84 Proof sketch (averaging) From Polyak and Juditsky (1992): θ n = θ n 1 γ n f n(θ n 1 ) f n(θ n 1 ) = 1 γ n (θ n 1 θ n ) f n(θ )+f n(θ )(θ n 1 θ ) = 1 γ n (θ n 1 θ n )+O( θ n 1 θ 2 ) f n(θ )+f (θ )(θ n 1 θ ) = 1 γ n (θ n 1 θ n )+O( θ n 1 θ 2 ) +O( θ n 1 θ )ε n θ n 1 θ = f (θ ) 1 f n(θ )+ 1 γ n f (θ ) 1 (θ n 1 θ n ) +O( θ n 1 θ 2 )+O( θ n 1 θ )ε n Averaging to cancel the term 1 γ n f (θ ) 1 (θ n 1 θ n )

85 Robustness to wrong constants for γ n = Cn α f(θ) = 1 2 θ 2 with i.i.d. Gaussian noise (d = 1) Left: α = 1/2 Right: α = 1 log[f(θ n ) f ] 5 0 α = 1/2 sgd C=1/5 ave C=1/5 sgd C=1 ave C=1 sgd C=5 ave C=5 log[f(θ n ) f ] 5 0 α = 1 sgd C=1/5 ave C=1/5 sgd C=1 ave C=1 sgd C=5 ave C= log(n) log(n) See also

86 Summary of new results (Bach and Moulines, 2011) Stochastic gradient descent with learning rate γ n = Cn α Strongly convex smooth objective functions Old: O(n 1 ) rate achieved without averaging for α = 1 New: O(n 1 ) rate achieved with averaging for α [1/2,1] Non-asymptotic analysis with explicit constants

87 Summary of new results (Bach and Moulines, 2011) Stochastic gradient descent with learning rate γ n = Cn α Strongly convex smooth objective functions Old: O(n 1 ) rate achieved without averaging for α = 1 New: O(n 1 ) rate achieved with averaging for α [1/2,1] Non-asymptotic analysis with explicit constants Non-strongly convex smooth objective functions Old: O(n 1/2 ) rate achieved with averaging for α = 1/2 New: O(max{n 1/2 3α/2,n α/2,n α 1 }) rate achieved without averaging for α [1/3,1] Take-home message Use α = 1/2 with averaging to be adaptive to strong convexity

88 Beyond stochastic gradient method Adding a proximal step Goal: min θ R df(θ)+ω(θ) = Ef n(θ)+ω(θ) Replace recursion θ n = θ n 1 γ n f n(θ n ) by θ n = min θ R d θ θ n 1 +γ n f n(θ n ) 2 2 +CΩ(θ) Xiao (2010); Hu et al. (2009) May be accelerated (Ghadimi and Lan, 2013) Related frameworks Regularized dual averaging (Nesterov, 2009; Xiao, 2010) Mirror descent (Nemirovski et al., 2009; Lan et al., 2012)

89 Outline 1. Large-scale machine learning and optimization Traditional statistical analysis Classical methods for convex optimization 2. Non-smooth stochastic approximation Stochastic (sub)gradient and averaging Non-asymptotic results and lower bounds Strongly convex vs. non-strongly convex 3. Smooth stochastic approximation algorithms Asymptotic and non-asymptotic results 4. Beyond decaying step-sizes 5. Finite data sets

90 Convex stochastic approximation Existing work Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Asymptotic analysis of averaging (Polyak and Juditsky, 1992; Ruppert, 1988) All step sizes γ n = Cn α with α (1/2,1) lead to O(n 1 ) for smooth strongly convex problems A single adaptive algorithm for smooth problems with convergence rate O(min{1/µn,1/ n}) in all situations?

91 Adaptive algorithm for logistic regression Logistic regression: (Φ(x n ),y n ) R d { 1,1} Single data point: f n (θ) = log(1+exp( y n θ Φ(x n ))) Generalization error: f(θ) = Ef n (θ)

92 Adaptive algorithm for logistic regression Logistic regression: (Φ(x n ),y n ) R d { 1,1} Single data point: f n (θ) = log(1+exp( y n θ Φ(x n ))) Generalization error: f(θ) = Ef n (θ) Cannot be strongly convex local strong convexity unless restricted to θ Φ(x n ) M (and with constants e M ) µ = lowest eigenvalue of the Hessian at the optimum f (θ ) logistic loss

93 Adaptive algorithm for logistic regression Logistic regression: (Φ(x n ),y n ) R d { 1,1} Single data point: f n (θ) = log(1+exp( y n θ Φ(x n ))) Generalization error: f(θ) = Ef n (θ) Cannot be strongly convex local strong convexity unless restricted to θ Φ(x n ) M (and with constants e M ) µ = lowest eigenvalue of the Hessian at the optimum f (θ ) n steps of averaged SGD with constant step-size 1/ ( 2R 2 n ) with R = radius of data (Bach, 2013): Ef( θ n ) f(θ ) min { } 1 n, R2 (15+5R θ0 θ ) 4 nµ Proof based on self-concordance (Nesterov and Nemirovski, 1994)

94 Self-concordance Usual definition for convex ϕ : R R: ϕ (t) 2ϕ (t) 3/2 Affine invariant Extendable to all convex functions on R d by looking at rays Used for the sharp proof of quadratic convergence of Newton method (Nesterov and Nemirovski, 1994) Generalized notion: ϕ (t) ϕ (t) Applicable to logistic regression (with extensions)

95 Self-concordance Usual definition for convex ϕ : R R: ϕ (t) 2ϕ (t) 3/2 Affine invariant Extendable to all convex functions on R d by looking at rays Used for the sharp proof of quadratic convergence of Newton method (Nesterov and Nemirovski, 1994) Generalized notion: ϕ (t) ϕ (t) Applicable to logistic regression (with extensions) Important properties Allows global Taylor expansions Relates expansions of derivatives of different orders

96 Adaptive algorithm for logistic regression Proof sketch Step 1: use existing result f( θ n ) f(θ )+ R2 n θ 0 θ 2 2 = O(1/ n) Step 2: f n(θ n 1 ) = 1 γ (θ n 1 θ n ) n 1 n k=1 f k (θ k 1) = 1 nγ (θ 0 θ n ) Step 3: f ( 1 n n k=1 θ ) k 1 1 n n k=1 f (θ k 1 ) 2 = O ( f( θ n ) f(θ ) ) = O(1/ n) using self-concordance Step 4a: if f µ-strongly convex, f( θ n ) f(θ ) 1 2µ f ( θ n ) 2 2 Step 4b: if f self-concordant, locally true with µ = λ min (f (θ ))

97 Adaptive algorithm for logistic regression Logistic regression: (Φ(x n ),y n ) R d { 1,1} Single data point: f n (θ) = log(1+exp( y n θ Φ(x n ))) Generalization error: f(θ) = Ef n (θ) Cannot be strongly convex local strong convexity unless restricted to θ Φ(x n ) M (and with constants e M ) µ = lowest eigenvalue of the Hessian at the optimum f (θ ) n steps of averaged SGD with constant step-size 1/ ( 2R 2 n ) with R = radius of data (Bach, 2013): Ef( θ n ) f(θ ) min { } 1 n, R2 (15+5R θ0 θ ) 4 nµ Proof based on self-concordance (Nesterov and Nemirovski, 1994)

98 Adaptive algorithm for logistic regression Logistic regression: (Φ(x n ),y n ) R d { 1,1} Single data point: f n (θ) = log(1+exp( y n θ Φ(x n ))) Generalization error: f(θ) = Ef n (θ) Cannot be strongly convex local strong convexity unless restricted to θ Φ(x n ) M (and with constants e M ) µ = lowest eigenvalue of the Hessian at the optimum f (θ ) n steps of averaged SGD with constant step-size 1/ ( 2R 2 n ) with R = radius of data (Bach, 2013): Ef( θ n ) f(θ ) min { } 1 n, R2 (15+5R θ0 θ ) 4 nµ A single adaptive algorithm for smooth problems with convergence rate O(1/n) in all situations?

99 Least-mean-square algorithm Least-squares: f(θ) = 1 2 E[ (y n Φ(x n ),θ ) 2] with θ R d SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) usually studied without averaging and decreasing step-sizes with strong convexity assumption E [ Φ(x n ) Φ(x n ) ] = H µ Id

100 Least-mean-square algorithm Least-squares: f(θ) = 1 2 E[ (y n Φ(x n ),θ ) 2] with θ R d SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) usually studied without averaging and decreasing step-sizes with strong convexity assumption E [ Φ(x n ) Φ(x n ) ] = H µ Id New analysis for averaging and constant step-size γ = 1/(4R 2 ) Assume Φ(x n ) R and y n Φ(x n ),θ σ almost surely No assumption regarding lowest eigenvalues of H Main result: Ef( θ n 1 ) f(θ ) 4σ2 d n + 4R2 θ 0 θ 2 n Matches statistical lower bound (Tsybakov, 2003) Non-asymptotic robust version of Györfi and Walk (1996)

101 Least-squares - Proof technique LMS recursion: θ n θ = [ I γφ(x n ) Φ(x n ) ] (θ n 1 θ )+γε n Φ(x n ) Simplified LMS recursion: with H = E [ Φ(x n ) Φ(x n ) ] θ n θ = [ I γh ] (θ n 1 θ )+γε n Φ(x n ) Direct proof technique of Polyak and Juditsky (1992), e.g., θ n θ = [ I γh ] n (θ0 θ )+γ n k=1 [ I γh ] n kεk Φ(x k ) Infinite expansion of Aguech, Moulines, and Priouret(2000) in powers of γ

102 Markov chain interpretation of constant step sizes LMS recursion for f n (θ) = 1 2 ( yn Φ(x n ),θ ) 2 θ n = θ n 1 γ ( Φ(x n ),θ n 1 y n ) Φ(xn ) The sequence (θ n ) n is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) - For least-squares, θ γ = θ θ n θ γ θ 0

103 Markov chain interpretation of constant step sizes LMS recursion for f n (θ) = 1 2 ( yn Φ(x n ),θ ) 2 θ n = θ n 1 γ ( Φ(x n ),θ n 1 y n ) Φ(xn ) The sequence (θ n ) n is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) For least-squares, θ γ = θ θ n θ θ 0

104 Markov chain interpretation of constant step sizes LMS recursion for f n (θ) = 1 2 ( yn Φ(x n ),θ ) 2 θ n = θ n 1 γ ( Φ(x n ),θ n 1 y n ) Φ(xn ) The sequence (θ n ) n is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) For least-squares, θ γ = θ θ n θ n θ θ 0

105 Markov chain interpretation of constant step sizes LMS recursion for f n (θ) = 1 2 ( yn Φ(x n ),θ ) 2 θ n = θ n 1 γ ( Φ(x n ),θ n 1 y n ) Φ(xn ) The sequence (θ n ) n is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) For least-squares, θ γ = θ θ n does not converge to θ but oscillates around it oscillations of order γ Ergodic theorem: Averaged iterates converge to θ γ = θ at rate O(1/n)

106 Simulations - synthetic examples Gaussian distributions - p = 20 0 synthetic square log 10 [f(θ) f(θ * )] /2R 2 1/8R 2 4 1/32R 2 1/2R 2 n 1/ log 10 (n)

107 Simulations - benchmarks alpha (p = 500, n = ), news (p = , n = ) log 10 [f(θ) f(θ * )] alpha square C=1 test 1/R 2 1/R 2 n 1/2 SAG log (n) alpha square C=opt test C/R 2 C/R 2 n 1/2 SAG log (n) news square C=1 test 0.2 news square C=opt test log 10 [f(θ) f(θ * )] /R 2 1/R 2 n 1/2 SAG log (n) C/R 2 C/R 2 n 1/2 SAG log 10 (n)

108 Beyond least-squares - Markov chain interpretation Recursion θ n = θ n 1 γf n(θ n 1 ) also defines a Markov chain Stationary distribution π γ such that f (θ)π γ (dθ) = 0 When f is not linear, f ( θπ γ (dθ)) f (θ)π γ (dθ) = 0

109 Beyond least-squares - Markov chain interpretation Recursion θ n = θ n 1 γf n(θ n 1 ) also defines a Markov chain Stationary distribution π γ such that f (θ)π γ (dθ) = 0 When f is not linear, f ( θπ γ (dθ)) f (θ)π γ (dθ) = 0 θ n oscillates around the wrong value θ γ θ θ n θ n θ γ θ 0 θ

110 Beyond least-squares - Markov chain interpretation Recursion θ n = θ n 1 γf n(θ n 1 ) also defines a Markov chain Stationary distribution π γ such that f (θ)π γ (dθ) = 0 When f is not linear, f ( θπ γ (dθ)) f (θ)π γ (dθ) = 0 θ n oscillates around the wrong value θ γ θ moreover, θ θ n = O p ( γ) Ergodic theorem averaged iterates converge to θ γ θ at rate O(1/n) moreover, θ θ γ = O(γ) (Bach, 2013)

111 Simulations - synthetic examples Gaussian distributions - p = 20 0 synthetic logistic 1 log 10 [f(θ) f(θ * )] 1 2 1/2R 2 3 1/8R 2 4 1/32R 2 5 1/2R 2 n 1/ log 10 (n)

112 Restoring convergence through online Newton steps Known facts 1. Averaged SGD with γ n n 1/2 leads to robust rate O(n 1/2 ) for all convex functions 2. Averaged SGD with γ n constant leads to robust rate O(n 1 ) for all convex quadratic functions 3. Newton s method squares the error at each iteration for smooth functions 4. A single step of Newton s method is equivalent to minimizing the quadratic Taylor expansion Online Newton step Rate: O((n 1/2 ) 2 +n 1 ) = O(n 1 ) Complexity: O(p) per iteration

113 Restoring convergence through online Newton steps Known facts 1. Averaged SGD with γ n n 1/2 leads to robust rate O(n 1/2 ) for all convex functions 2. Averaged SGD with γ n constant leads to robust rate O(n 1 ) for all convex quadratic functions O(n 1 ) 3. Newton s method squares the error at each iteration for smooth functions O((n 1/2 ) 2 ) 4. A single step of Newton s method is equivalent to minimizing the quadratic Taylor expansion Online Newton step Rate: O((n 1/2 ) 2 +n 1 ) = O(n 1 ) Complexity: O(p) per iteration

114 Restoring convergence through online Newton steps The Newton step for f = Ef n (θ) def = E [ l(y n, θ,φ(x n ) ) ] at θ is equivalent to minimizing the quadratic approximation g(θ) = f( θ)+ f ( θ),θ θ θ θ,f ( θ)(θ θ) = f( θ)+ Ef n( θ),θ θ θ θ,ef n( θ)(θ θ) [ = E f( θ)+ f n( θ),θ θ θ θ,f n( θ)(θ θ) ]

115 Restoring convergence through online Newton steps The Newton step for f = Ef n (θ) def = E [ l(y n, θ,φ(x n ) ) ] at θ is equivalent to minimizing the quadratic approximation g(θ) = f( θ)+ f ( θ),θ θ θ θ,f ( θ)(θ θ) = f( θ)+ Ef n( θ),θ θ θ θ,ef n( θ)(θ θ) [ = E f( θ)+ f n( θ),θ θ θ θ,f n( θ)(θ θ) ] Complexity of least-mean-square recursion for g is O(p) θ n = θ n 1 γ [ f n( θ)+f n( θ)(θ n 1 θ) ] f n( θ) = l (y n, θ,φ(x n ) )Φ(x n ) Φ(x n ) has rank one New online Newton step without computing/inverting Hessians

116 Choice of support point for online Newton step Two-stage procedure (1) Run n/2 iterations of averaged SGD to obtain θ (2) Run n/2 iterations of averaged constant step-size LMS Reminiscent of one-step estimators (see, e.g., Van der Vaart, 2000) Provable convergence rate of O(p/n) for logistic regression Additional assumptions but no strong convexity

117 Logistic regression - Proof technique Using generalized self-concordance of ϕ : u log(1+e u ): ϕ (u) ϕ (u) NB: difference with regular self-concordance: ϕ (u) 2ϕ (u) 3/2 Using novel high-probability convergence results for regular averaged stochastic gradient descent Requires assumption on the kurtosis in every direction, i.e., E Φ(x n ),η 4 κ [ E Φ(x n ),η 2] 2

118 Choice of support point for online Newton step Two-stage procedure (1) Run n/2 iterations of averaged SGD to obtain θ (2) Run n/2 iterations of averaged constant step-size LMS Reminiscent of one-step estimators (see, e.g., Van der Vaart, 2000) Provable convergence rate of O(p/n) for logistic regression Additional assumptions but no strong convexity Update at each iteration using the current averaged iterate Recursion: θ n = θ n 1 γ [ f n( θ n 1 )+f n( θ n 1 )(θ n 1 θ n 1 ) ] No provable convergence rate (yet) but best practical behavior Note (dis)similarity with regular SGD: θ n = θ n 1 γf n(θ n 1 )

119 Online Newton algorithm Current proof (Flammarion et al., 2014) Recursion { θn = θ n 1 γ [ f n( θ n 1 )+f n( θ n 1 )(θ n 1 θ n 1 ) ] θ n = θ n n (θ n θ n 1 ) Instance of two-time-scale stochastic approximation (Borkar, 1997) Given θ, θ n = θ n 1 γ [ f n( θ) + f n( θ)(θ n 1 θ) ] defines a homogeneous Markov chain (fast dynamics) θ n is updated at rate 1/n (slow dynamics) Difficulty: preserving robustness to ill-conditioning

120 Simulations - synthetic examples Gaussian distributions - p = 20 0 synthetic logistic 1 0 synthetic logistic 2 log 10 [f(θ) f(θ * )] /2R 2 3 1/8R 2 3 every 2 p 4 1/32R 2 every iter. 4 2 step 5 1/2R 2 n 1/2 5 2 step dbl log 10 (n) log 10 (n) log 10 [f(θ) f(θ * )] 1

121 Simulations - benchmarks alpha (p = 500, n = ), news (p = , n = ) alpha logistic C=1 test alpha logistic C=opt test log 10 [f(θ) f(θ * )] /R /R 2 n 1/2 SAG 2 Adagrad Newton log (n) C/R C/R 2 n 1/2 SAG 2 Adagrad Newton log (n) news logistic C=1 test 0.2 news logistic C=opt test log 10 [f(θ) f(θ * )] /R /R 2 n 1/2 SAG 0.8 Adagrad 1 Newton log (n) C/R C/R 2 n 1/2 SAG 0.8 Adagrad 1 Newton log 10 (n)

122 Outline 1. Large-scale machine learning and optimization Traditional statistical analysis Classical methods for convex optimization 2. Non-smooth stochastic approximation Stochastic (sub)gradient and averaging Non-asymptotic results and lower bounds Strongly convex vs. non-strongly convex 3. Smooth stochastic approximation algorithms Asymptotic and non-asymptotic results 4. Beyond decaying step-sizes 5. Finite data sets

123 Going beyond a single pass over the data Stochastic approximation Assumes infinite data stream Observations are used only once Directly minimizes testing cost E (x,y) l(y,θ Φ(x))

124 Going beyond a single pass over the data Stochastic approximation Assumes infinite data stream Observations are used only once Directly minimizes testing cost E (x,y) l(y,θ Φ(x)) Machine learning practice Finite data set (x 1,y 1,...,x n,y n ) Multiple passes Minimizes training cost 1 n n i=1 l(y i,θ Φ(x i )) Need to regularize (e.g., by the l 2 -norm) to avoid overfitting Goal: minimize g(θ) = 1 n n f i (θ) i=1

125 Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i,θ Φ(x i ) ) +µω(θ) i=1 Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n Linear (e.g., exponential) convergence rate in O(e αt ) Iteration complexity is linear in n (with line search) n f i(θ t 1 ) i=1

126 Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i,θ Φ(x i ) ) +µω(θ) i=1 Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n n f i(θ t 1 ) i=1

127 Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i,θ Φ(x i ) ) +µω(θ) i=1 Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n Linear (e.g., exponential) convergence rate in O(e αt ) Iteration complexity is linear in n (with line search) n f i(θ t 1 ) i=1 Stochastic gradient descent: θ t = θ t 1 γ t f i(t) (θ t 1) Sampling with replacement: i(t) random element of {1,...,n} Convergence rate in O(1/t) Iteration complexity is independent of n (step size selection?)

128 Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i,θ Φ(x i ) ) +µω(θ) i=1 Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n n f i(θ t 1 ) i=1 Stochastic gradient descent: θ t = θ t 1 γ t f i(t) (θ t 1)

129 Stochastic vs. deterministic methods Goal = best of both worlds: Linear rate with O(1) iteration cost Robustness to step size log(excess cost) stochastic deterministic time

130 Stochastic vs. deterministic methods Goal = best of both worlds: Linear rate with O(1) iteration cost Robustness to step size log(excess cost) hybrid time stochastic deterministic

131 Accelerating gradient methods - Related work Nesterov acceleration Nesterov (1983, 2004) Better linear rate but still O(n) iteration cost Hybrid methods, incremental average gradient, increasing batch size Bertsekas (1997); Blatt et al. (2008); Friedlander and Schmidt (2011) Linear rate, but iterations make full passes through the data.

132 Accelerating gradient methods - Related work Momentum, gradient/iterate averaging, stochastic version of accelerated batch gradient methods Polyak and Juditsky (1992); Tseng (1998); Sunehag et al. (2009); Ghadimi and Lan (2010); Xiao (2010) Can improve constants, but still have sublinear O(1/t) rate Constant step-size stochastic gradient (SG), accelerated SG Kesten (1958); Delyon and Juditsky (1993); Solodov (1998); Nedic and Bertsekas (2000) Linear convergence, but only up to a fixed tolerance. Stochastic methods in the dual Shalev-Shwartz and Zhang (2012) Similar linear rate but limited choice for the f i s

133 Stochastic average gradient (Le Roux, Schmidt, and Bach, 2012) Stochastic average gradient (SAG) iteration Keep in memory the gradients of all functions f i, i = 1,...,n Random selection i(t) {1,...,n} with replacement { Iteration: θ t = θ t 1 γ n t f yi t with yi t n = i (θ t 1 ) if i = i(t) otherwise i=1 y t 1 i

Large-scale machine learning and convex optimization

Large-scale machine learning and convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Allerton Conference - September 2015 Slides available at www.di.ens.fr/~fbach/gradsto_allerton.pdf