Large-scale machine learning and convex optimization

Size: px

Start display at page:

Download "Large-scale machine learning and convex optimization"

Poppy Day
5 years ago
Views:

1 Large-scale machine learning and convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Allerton Conference - September 2015 Slides available at

2 Context Machine learning for big data Large-scale machine learning: large d, large n, large k d : dimension of each observation (input) n : number of observations k : number of tasks (dimension of outputs) Examples: computer vision, bioinformatics, text processing Ideal running-time complexity: O(dn + kn) Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951) Mixing statistics and optimization Using smoothness to go beyond stochastic gradient descent

3 Visual object recognition

4 Personal photos

5 Learning for bioinformatics - Proteins Crucial components of cell life Predicting multiple functions and interactions Massive data: up to 1 millions for humans! Complex data Amino-acid sequence Link with DNA Tri-dimensional molecule

6 Search engines - advertising

7 Context Machine learning for big data Large-scale machine learning: large d, large n, large k d : dimension of each observation (input) n : number of observations k : number of tasks (dimension of outputs) Examples: computer vision, bioinformatics, text processing Ideal running-time complexity: O(dn + kn) Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951) Mixing statistics and optimization Using smoothness to go beyond stochastic gradient descent

8 Context Machine learning for big data Large-scale machine learning: large d, large n, large k d : dimension of each observation (input) n : number of observations k : number of tasks (dimension of outputs) Examples: computer vision, bioinformatics, text processing Ideal running-time complexity: O(dn + kn) Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951) Mixing statistics and optimization Using smoothness to go beyond stochastic gradient descent

9 Outline 1. Large-scale machine learning and optimization Traditional statistical analysis Classical methods for convex optimization 2. Non-smooth stochastic approximation Stochastic (sub)gradient and averaging Non-asymptotic results and lower bounds Strongly convex vs. non-strongly convex 3. Smooth stochastic approximation algorithms Asymptotic and non-asymptotic results Beyond decaying step-sizes 4. Finite data sets

10 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ Φ(x) of features Φ(x) R d (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i,θ Φ(x i ) ) + µω(θ) θ R d n i=1 convex data fitting term + regularizer

11 Usual losses Regression: y R, prediction ŷ = θ Φ(x) quadratic loss 1 2 (y ŷ)2 = 1 2 (y θ Φ(x)) 2

12 Usual losses Regression: y R, prediction ŷ = θ Φ(x) quadratic loss 1 2 (y ŷ)2 = 1 2 (y θ Φ(x)) 2 Classification : y { 1,1}, prediction ŷ = sign(θ Φ(x)) loss of the form l(yθ Φ(x)) True 0-1 loss: l(yθ Φ(x)) = 1 yθ Φ(x)<0 Usual convex losses: hinge square logistic

13 Main motivating examples Support vector machine (hinge loss) l(y,θ Φ(X)) = max{1 Yθ Φ(X),0} Logistic regression l(y,θ Φ(X)) = log(1+exp( Yθ Φ(X))) Least-squares regression l(y,θ Φ(X)) = 1 2 (Y θ Φ(X)) 2

14 Usual regularizers Main goal: avoid overfitting (squared) Euclidean norm: θ 2 2 = d j=1 θ j 2 Numerically well-behaved Representer theorem and kernel methods : θ = n i=1 α iφ(x i ) See, e.g., Schölkopf and Smola (2001); Shawe-Taylor and Cristianini (2004) Sparsity-inducing norms Main example: l 1 -norm θ 1 = d j=1 θ j Perform model selection as well as regularization Non-smooth optimization and structured sparsity See, e.g., Bach, Jenatton, Mairal, and Obozinski (2012b,a)

15 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ Φ(x) of features Φ(x) R d (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i,θ Φ(x i ) ) + µω(θ) θ R d n i=1 convex data fitting term + regularizer

16 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ Φ(x) of features Φ(x) R d (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i,θ Φ(x i ) ) + µω(θ) θ R d n i=1 convex data fitting term + regularizer Empirical risk: ˆf(θ) = 1 n n i=1 l(y i,θ Φ(x i )) training cost Expected risk: f(θ) = E (x,y) l(y,θ Φ(x)) testing cost Two fundamental questions: (1) computing ˆθ and (2) analyzing ˆθ May be tackled simultaneously

17 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ Φ(x) of features Φ(x) R d (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i,θ Φ(x i ) ) + µω(θ) θ R d n i=1 convex data fitting term + regularizer Empirical risk: ˆf(θ) = 1 n n i=1 l(y i,θ Φ(x i )) training cost Expected risk: f(θ) = E (x,y) l(y,θ Φ(x)) testing cost Two fundamental questions: (1) computing ˆθ and (2) analyzing ˆθ May be tackled simultaneously

18 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ Φ(x) of features Φ(x) R d (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i,θ Φ(x i ) ) such that Ω(θ) D θ R d n i=1 convex data fitting term + constraint Empirical risk: ˆf(θ) = 1 n n i=1 l(y i,θ Φ(x i )) training cost Expected risk: f(θ) = E (x,y) l(y,θ Φ(x)) testing cost Two fundamental questions: (1) computing ˆθ and (2) analyzing ˆθ May be tackled simultaneously

19 General assumptions Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Bounded features Φ(x) R d : Φ(x) 2 R Empirical risk: ˆf(θ) = 1 n n i=1 l(y i,θ Φ(x i )) training cost Expected risk: f(θ) = E (x,y) l(y,θ Φ(x)) testing cost Loss for a single observation: f i (θ) = l(y i,θ Φ(x i )) i, f(θ) = Ef i (θ) Properties of f i,f, ˆf Convex on R d Additional regularity assumptions: Lipschitz-continuity, smoothness and strong convexity

20 Lipschitz continuity Bounded gradients of f (Lipschitz-continuity): the function f if convex, differentiable and has (sub)gradients uniformly bounded by B on the ball of center 0 and radius D: Machine learning θ R d, θ 2 D f (θ) 2 B with f(θ) = 1 n n i=1 l(y i,θ Φ(x i )) G-Lipschitz loss and R-bounded data: B = GR

21 Smoothness and strong convexity A function f : R d R is L-smooth if and only if it is differentiable and its gradient is L-Lipschitz-continuous θ 1,θ 2 R d, f (θ 1 ) f (θ 2 ) 2 L θ 1 θ 2 2 If f is twice differentiable: θ R d, f (θ) L Id smooth non smooth

22 Smoothness and strong convexity A function f : R d R is L-smooth if and only if it is differentiable and its gradient is L-Lipschitz-continuous θ 1,θ 2 R d, f (θ 1 ) f (θ 2 ) 2 L θ 1 θ 2 2 If f is twice differentiable: θ R d, f (θ) L Id Machine learning with f(θ) = 1 n n i=1 l(y i,θ Φ(x i )) Hessian covariance matrix 1 n n i=1 Φ(x i)φ(x i ) l-smooth loss and R-bounded data: L = lr 2

23 Smoothness and strong convexity A function f : R d R is µ-strongly convex if and only if θ 1,θ 2 R d, f(θ 1 ) f(θ 2 )+f (θ 2 ) (θ 1 θ 2 )+ µ 2 θ 1 θ If f is twice differentiable: θ R d, f (θ) µ Id convex strongly convex

24 Smoothness and strong convexity A function f : R d R is µ-strongly convex if and only if θ 1,θ 2 R d, f(θ 1 ) f(θ 2 )+f (θ 2 ) (θ 1 θ 2 )+ µ 2 θ 1 θ If f is twice differentiable: θ R d, f (θ) µ Id Machine learning with f(θ) = 1 n n i=1 l(y i,θ Φ(x i )) Hessian covariance matrix 1 n n i=1 Φ(x i)φ(x i ) Data with invertible covariance matrix (low correlation/dimension)

25 Smoothness and strong convexity A function f : R d R is µ-strongly convex if and only if θ 1,θ 2 R d, f(θ 1 ) f(θ 2 )+f (θ 2 ) (θ 1 θ 2 )+ µ 2 θ 1 θ If f is twice differentiable: θ R d, f (θ) µ Id Machine learning with f(θ) = 1 n n i=1 l(y i,θ Φ(x i )) Hessian covariance matrix 1 n n i=1 Φ(x i)φ(x i ) Data with invertible covariance matrix (low correlation/dimension) Adding regularization by µ 2 θ 2 creates additional bias unless µ is small

26 Summary of smoothness/convexity assumptions Bounded gradients of f (Lipschitz-continuity): the function f if convex, differentiable and has (sub)gradients uniformly bounded by B on the ball of center 0 and radius D: θ R d, θ 2 D f (θ) 2 B Smoothness of f: the function f is convex, differentiable with L-Lipschitz-continuous gradient f : θ 1,θ 2 R d, f (θ 1 ) f (θ 2 ) 2 L θ 1 θ 2 2 Strong convexity of f: The function f is strongly convex with respect to the norm, with convexity constant µ > 0: θ 1,θ 2 R d, f(θ 1 ) f(θ 2 )+f (θ 2 ) (θ 1 θ 2 )+ µ 2 θ 1 θ 2 2 2

27 Analysis of empirical risk minimization Approximation and estimation errors: Θ = {θ R d,ω(θ) D} [ ] [ ] f(ˆθ) min θ R df(θ) = f(ˆθ) min f(θ) + minf(θ) min θ Θ θ Θ θ R df(θ) Estimation error Approximation error NB: may replace min by best (non-linear) predictions θ Rdf(θ)

28 Analysis of empirical risk minimization Approximation and estimation errors: Θ = {θ R d,ω(θ) D} [ ] [ ] f(ˆθ) min θ R df(θ) = f(ˆθ) min f(θ) + minf(θ) min θ Θ θ Θ θ R df(θ) Estimation error Approximation error 1. Uniform deviation bounds, with ˆθ argmin θ Θ ˆf(θ) f(ˆθ) min θ Θ f(θ) = [ f(ˆθ) ˆf(ˆθ) ] + [ˆf(ˆθ) ˆf(θ Θ ) ] + [ˆf(θ Θ ) f(θ Θ) ] sup ˆf(θ) f(θ) + 0 +sup ˆf(θ) f(θ) θ Θ θ Θ

29 Analysis of empirical risk minimization Approximation and estimation errors: Θ = {θ R d,ω(θ) D} [ ] [ ] f(ˆθ) min θ R df(θ) = f(ˆθ) min f(θ) + minf(θ) min θ Θ θ Θ θ R df(θ) Estimation error Approximation error 1. Uniform deviation bounds, with ˆθ argmin θ Θ ˆf(θ) f(ˆθ) min θ Θ Typically slow rate O ( 1/ n ) f(θ) 2 sup ˆf(θ) f(θ) θ Θ 2. More refined concentration results with faster rates O(1/n)

30 Motivation from least-squares For least-squares, we have l(y,θ Φ(x)) = 1 2 (y θ Φ(x)) 2, and sup θ 2 D f(θ) ˆf(θ) = 1 ( 1 n 2 θ Φ(x i )Φ(x i ) EΦ(X)Φ(X) )θ n i=1 ( 1 n ) θ y i Φ(x i ) EYΦ(X) + 1 ( 1 n yi 2 EY ), 2 n 2 n i=1 i=1 1 n i )Φ(x i ) n i=1φ(x EΦ(X)Φ(X) op +D 1 n y i Φ(x i ) EYΦ(X) n n 2 2 yi 2 EY 2, n f(θ) ˆf(θ) D2 2 i=1 sup f(θ) ˆf(θ) O(1/ n) with high probability θ 2 D i=1

31 Symmetrization with Rademacher variables Let D = {x 1,y 1,...,x n,y n} an independent copy of the data D = {x 1,y 1,...,x n,y n }, with corresponding loss functions f i (θ) E [ sup f(θ) ˆf(θ) ] = E [ ( sup f(θ) 1 n ) ] f i (θ) θ Θ θ Θ n [ i=1 = E sup 1 n E ( f θ Θ n i(θ) f i (θ) D ) ] i=1 [ [ E E sup 1 n ]] ( f θ Θ n i (θ) f i (θ) D i=1 [ = E sup 1 n ( f θ Θ n i (θ) f i (θ) ) ] i=1 [ = E sup 1 n ( ε i f θ Θ n i (θ) f i (θ) ) ] with ε i uniform in { 1,1} i=1 [ 2E 1 n ] ε i f i (θ) n = Rademacher complexity sup θ Θ i=1

32 Rademacher complexity DefinetheRademachercomplexity of theclass of functions(x,y) l(y,θ Φ(X)) as [ R n = E sup θ Θ 1 n n ] ε i f i (θ). Note two expectations, with respect to D and with respect to ε i=1 Main property: E [ ] sup f(θ) ˆf(θ) 2Rn θ Θ

33 From Rademacher complexity to bound in high probability Let Z = sup θ Θ f(θ) ˆf(θ) By changing the pair (x i,y i ), Z may only change by 2 n sup l(y,θ Φ(X)) 2 ( ) 2( sup l(y,0) +GRD l0 +GRD ) = c n n with sup l(y,0) = l 0 MacDiarmid inequality: with probability greater than 1 δ, Z EZ + n 2 c log 1 δ 2R n+ 2 n ( l0 +GRD ) log 1 δ

34 Bounding the Rademacher average - I Wehave,withϕ i (u) = l(y i,u) l(y i,0)isalmostsurelyb-lipschitz: [ R n = E sup 1 θ Θ n [ E sup 1 θ Θ n l [ 0 +E n n ] ε i f i (θ) i=1 n ] ε i f i (0) i=1 sup θ Θ = l [ 0 +E sup n θ Θ 1 n 1 n [ +E sup θ Θ 1 n n [ ε i fi (θ) f i (0) ] ] i=1 n [ ε i fi (θ) f i (0) ]] i=1 n i=1 ] ε i ϕ i (θ Φ(x i )) Using Ledoux-Talagrand concentration results for Rademacher averages (since ϕ i is G-Lipschitz), we get: R n l [ 0 +2G E sup 1 n θ 2 D n n ε i θ Φ(x i )] i=1

35 We have: Bounding the Rademacher average - II R n l 0 n +2GE [ = l 0 +2GE n D1 n l 0 +2GD n 2(l 0+GRD) n sup 1 n ε i θ Φ(x i )] θ 2 D n i=1 n ε i Φ(x i ) i=1 E 1 n 2 n ε i Φ(x i ) i=1 by using Φ(x) 2 R 2 2 by Jensen s inequality Overall, we get, with probability 1 δ: sup f(θ) ˆf(θ) 1 ( n l0 +GRD)(4+ θ Θ 2log 1 δ )

36 We have, with probability 1 δ Putting it all together For exact minimizer ˆθ argmin θ Θ ˆf(θ), we have f(ˆθ) min θ Θ For all θ Θ f(θ) min θ Θ f(θ) 2 sup ˆf(θ) f(θ) θ Θ 2 n ( l0 +GRD)(4+ 2log 1 δ f(θ) 2 sup ˆf(θ) f(θ) + [ˆf(θ) ] ˆf(ˆθ) θ Θ Only need to optimize with precision 2 n (l 0 +GRD) )

37 Slow rate for supervised learning (summary) Assumptions (f is the expected risk, ˆf the empirical risk) Ω(θ) = θ 2 (Euclidean norm) Linear predictors: θ(x) = θ Φ(x), with Φ(x) 2 R a.s. G-Lipschitz loss: f and ˆf are GR-Lipschitz on Θ = { θ 2 D} No assumptions regarding convexity With probability greater than 1 δ sup ˆf(θ) f(θ) (l 0+GRD) [2+ θ Θ n 2log 2 δ Expectated estimation error: E [ sup ˆf(θ) f(θ) ] 4(l 0+GRD) θ Θ n Using Rademacher averages (see, e.g., Boucheron et al., 2005) Lipschitz functions slow rate ]

38 Motivation from mean estimation Estimator ˆθ = 1 n n i=1 z i = argmin θ R 1 2n n i=1 (θ z i) 2 = ˆf(θ) From before: f(θ) = 1 2 E(θ z)2 = 1 2 (θ Ez) var(z) = ˆf(θ)+O(1/ n) f(ˆθ) = 1 2 (ˆθ Ez) var(z) = f(ez)+o(1/ n)

39 Motivation from mean estimation Estimator ˆθ = 1 n n i=1 z i = argmin θ R 1 2n n i=1 (θ z i) 2 = ˆf(θ) From before: f(θ) = 1 2 E(θ z)2 = 1 2 (θ Ez) var(z) = ˆf(θ)+O(1/ n) f(ˆθ) = 1 2 (ˆθ Ez) var(z) = f(ez)+o(1/ n) More refined/direct bound: f(ˆθ) f(ez) = 1 2 (ˆθ Ez) 2 E [ f(ˆθ) f(ez) ] = 1 2 E ( 1 n n i=1 ) 2 z i Ez = 1 2n var(z) Bound only at ˆθ + strong convexity

40 Fast rate for supervised learning Assumptions (f is the expected risk, ˆf the empirical risk) Same as before (bounded features, Lipschitz loss) Regularized risks: f µ (θ) = f(θ)+ µ 2 θ 2 2 and ˆf µ (θ) = ˆf(θ)+ µ 2 θ 2 2 Convexity For any a > 0, with probability greater than 1 δ, for all θ R d, f µ (θ) min (η) (1+a)(ˆf µ (θ) min ˆfµ (η))+ 8(1+ 1 a )G2 R 2 (32+log 1 δ ) η R dfµ η R d µn Results from Sridharan, Srebro, and Shalev-Shwartz (2008) see also Boucheron and Massart (2011) and references therein Strongly convex functions fast rate Warning: µ should decrease with n to reduce approximation error

41 Complexity results in convex optimization for ML Assumption: f convex on R d Classical generic algorithms (sub)gradient method/descent Accelerated gradient descent Newton method Key additional properties of f Lipschitz continuity, smoothness or strong convexity Key insight from Bottou and Bousquet (2008) In machine learning, no need to optimize below estimation error Key reference: Nesterov (2004)

42 Summary of smoothness/convexity assumptions Bounded gradients of f (Lipschitz-continuity): the function f if convex, differentiable and has (sub)gradients uniformly bounded by B on the ball of center 0 and radius D: θ R d, θ 2 D f (θ) 2 B Smoothness of f: the function f is convex, differentiable with L-Lipschitz-continuous gradient f : θ 1,θ 2 R d, f (θ 1 ) f (θ 2 ) 2 L θ 1 θ 2 2 Strong convexity of f: The function f is strongly convex with respect to the norm, with convexity constant µ > 0: θ 1,θ 2 R d, f(θ 1 ) f(θ 2 )+f (θ 2 ) (θ 1 θ 2 )+ µ 2 θ 1 θ 2 2 2

43 Subgradient method/descent (Shor et al., 1985) Assumptions f convex and B-Lipschitz-continuous on { θ 2 D} ( Algorithm: θ t = Π D θ t 1 2D ) B t f (θ t 1 ) Π D : orthogonal projection onto { θ 2 D} Bound: f ( 1 t t 1 k=0 θ k ) f(θ ) 2DB t Three-line proof (for constant step-size) Best possible convergence rate after O(d) iterations

44 Subgradient method/descent - proof - I Iteration: θ t = Π D (θ t 1 γ t f (θ t 1 )) with γ t = 2D B t Assumption: f (θ) 2 B and θ 2 D θ t θ 2 2 θ t 1 θ γ t f (θ t 1 ) 2 2 by contractivity of projections θ t 1 θ 2 2+B 2 γ 2 t 2γ t (θ t 1 θ ) f (θ t 1 ) because f (θ t 1 ) 2 B θ t 1 θ 2 2+B 2 γ 2 t 2γ t [ f(θt 1 ) f(θ ) ] (property of subgradients) leading to f(θ t 1 ) f(θ ) B2 γ t [ θt 1 θ 2 2γ 2 θ t θ 2 ] 2 t

45 Starting from Subgradient method/descent - proof - II t [ f(θu 1 ) f(θ ) ] u=1 f(θ t 1 ) f(θ ) B2 γ t 2 = = Using convexity: f t u=1 t u=1 ( 1 t t u=1 t u=1 B 2 γ u 2 B 2 γ u 2 B 2 γ u 2 B 2 γ u 2 t 1 k=0 + + t u=1 t 1 u=1 + θ k ) t 1 u=1 + 1 [ θt 1 θ 2 2γ 2 θ t θ 2 ] 2 t 1 [ θu 1 θ 2 2γ 2 θ u θ 2 ] 2 u ( θ u θ ) θ 0 θ θ t θ 2 2 2γ u+1 2γ u 2γ 1 2γ t 4D 2( 1 2γ u+1 1 2γ u ) + 4D 2 2γ 1 + 4D2 2γ t 2DB t with γ t = 2D B t f(θ ) 2DB t

46 Subgradient descent for machine learning Assumptions (f is the expected risk, ˆf the empirical risk) Linear predictors: θ(x) = θ Φ(x), with Φ(x) 2 R a.s. ˆf(θ) = 1 n n i=1 l(y i,φ(x i ) θ) G-Lipschitz loss: f and ˆf are GR-Lipschitz on Θ = { θ 2 D} Statistics: with probability greater than 1 δ sup ˆf(θ) f(θ) GRD [2+ 2log 2 θ Θ n δ ] Optimization: after t iterations of subgradient method ˆf(ˆθ) min η Θ ˆf(η) GRD t t = n iterations, with total running-time complexity of O(n 2 d)

47 Subgradient descent - strong convexity Assumptions f convex and B-Lipschitz-continuous on { θ 2 D} f µ-strongly convex ) 2 Algorithm: θ t = Π D (θ t 1 µ(t+1) f (θ t 1 ) Bound: f ( 2 t(t+1) t k=1 kθ k 1 ) f(θ ) 2B2 µ(t+1) Three-line proof Best possible convergence rate after O(d) iterations

48 Subgradient method - strong convexity - proof - I Iteration: θ t = Π D (θ t 1 γ t f (θ t 1 )) with γ t = 2 µ(t+1) Assumption: f (θ) 2 B and θ 2 D and µ-strong convexity of f θ t θ 2 2 θ t 1 θ γ t f (θ t 1 ) 2 2 by contractivity of projections θ t 1 θ 2 2+B 2 γt 2 2γ t (θ t 1 θ ) f (θ t 1 ) because f (θ t 1 ) 2 B θ t 1 θ 2 2+B 2 γt 2 [ 2γ t f(θt 1 ) f(θ )+ µ 2 θ t 1 θ 2 ] 2 (property of subgradients and strong convexity) leading to f(θ t 1 ) f(θ ) B2 γ t B 2 µ(t+1) + µ 2 [1 γ t µ ] θ t 1 θ γ t θ t θ 2 2 [t 1 2 ] θt 1 θ 2 2 µ(t+1) θ t θ 2 2 4

49 Subgradient method - strong convexity - proof - II From f(θ t 1 ) f(θ ) B2 µ(t+1) + µ 2 [t 1 2 ] θt 1 θ 2 2 µ(t+1) θ t θ t u [ f(θ u 1 ) f(θ ) ] u=1 u t=1 B 2 u µ(u+1) t [ u(u 1) θu 1 θ 2 2 u(u+1) θ u θ 2 2] u=1 B2 t µ + 1 [ 0 t(t+1) θt θ 2 B 4 2] 2 t µ Using convexity: f ( 2 t(t+1) t u=1 uθ u 1 ) f(θ ) 2B2 µ(t+1) NB: with step-size γ n = 1/(nµ), extra logarithmic factor

50 (smooth) gradient descent Assumptions f convex with L-Lipschitz-continuous gradient Minimum attained at θ Algorithm: Bound: Three-line proof θ t = θ t 1 1 L f (θ t 1 ) f(θ t ) f(θ ) 2L θ 0 θ 2 t+4 Not best possible convergence rate after O(d) iterations

51 (smooth) gradient descent - strong convexity Assumptions f convex with L-Lipschitz-continuous gradient f µ-strongly convex Algorithm: Bound: θ t = θ t 1 1 L f (θ t 1 ) f(θ t ) f(θ ) (1 µ/l) t[ f(θ 0 ) f(θ ) ] Three-line proof Adaptivity of gradient descent to problem difficulty Line search

52 Accelerated gradient methods (Nesterov, 1983) Assumptions f convex with L-Lipschitz-cont. gradient, min. attained at θ Algorithm: Bound: Ten-line proof θ t = η t 1 1 L f (η t 1 ) η t = θ t + t 1 t+2 (θ t θ t 1 ) f(θ t ) f(θ ) 2L θ 0 θ 2 (t+1) 2 Not improvable Extension to strongly convex functions

53 Optimization for sparsity-inducing norms (see Bach, Jenatton, Mairal, and Obozinski, 2012b) Gradient descent as a proximal method (differentiable functions) θ t+1 = arg min θ R df(θ t)+(θ θ t ) f(θ t )+ L 2 θ θ t 2 2 θ t+1 = θ t 1 L f(θ t)

54 Optimization for sparsity-inducing norms (see Bach, Jenatton, Mairal, and Obozinski, 2012b) Gradient descent as a proximal method (differentiable functions) θ t+1 = arg min θ R df(θ t)+(θ θ t ) f(θ t )+ L 2 θ θ t 2 2 θ t+1 = θ t 1 L f(θ t) Problems of the form: min θ R df(θ)+µω(θ) θ t+1 = arg min θ R df(θ t)+(θ θ t ) f(θ t )+µω(θ)+ L 2 θ θ t 2 2 Ω(θ) = θ 1 Thresholded gradient descent Similar convergence rates than smooth optimization Acceleration methods (Nesterov, 2007; Beck and Teboulle, 2009)

55 Summary: minimizing convex functions Assumption: f convex Gradient descent: θ t = θ t 1 γ t f (θ t 1 ) O(1/ t) convergence rate for non-smooth convex functions O(1/t) convergence rate for smooth convex functions O(e ρt ) convergence rate for strongly smooth convex functions Newton method: θ t = θ t 1 f (θ t 1 ) 1 f (θ t 1 ) O ( e ρ2t ) convergence rate

56 Summary: minimizing convex functions Assumption: f convex Gradient descent: θ t = θ t 1 γ t f (θ t 1 ) O(1/ t) convergence rate for non-smooth convex functions O(1/t) convergence rate for smooth convex functions O(e ρt ) convergence rate for strongly smooth convex functions Newton method: θ t = θ t 1 f (θ t 1 ) 1 f (θ t 1 ) O ( e ρ2t ) convergence rate Key insights from Bottou and Bousquet (2008) 1. In machine learning, no need to optimize below statistical error 2. In machine learning, cost functions are averages Stochastic approximation

57 Outline 1. Large-scale machine learning and optimization Traditional statistical analysis Classical methods for convex optimization 2. Non-smooth stochastic approximation Stochastic (sub)gradient and averaging Non-asymptotic results and lower bounds Strongly convex vs. non-strongly convex 3. Smooth stochastic approximation algorithms Asymptotic and non-asymptotic results Beyond decaying step-sizes 4. Finite data sets

58 Stochastic approximation Goal: Minimizing a function f defined on R d given only unbiased estimates f n(θ n ) of its gradients f (θ n ) at certain points θ n R d Stochastic approximation (much) broader applicability beyond convex optimization θ n = θ n 1 γ n h n (θ n 1 ) with E [ h n (θ n 1 ) θ n 1 ] = h(θn 1 ) Beyond convex problems, i.i.d assumption, finite dimension, etc. Typically asymptotic results See, e.g., Kushner and Yin (2003); Benveniste et al. (2012)

59 Stochastic approximation Goal: Minimizing a function f defined on R d given only unbiased estimates f n(θ n ) of its gradients f (θ n ) at certain points θ n R d Machine learning - statistics loss for a single pair of observations: f n (θ) = l(y n,θ Φ(x n )) f(θ) = Ef n (θ) = El(y n,θ Φ(x n )) = generalization error Expected gradient: f (θ) = Ef n(θ) = E { l (y n,θ Φ(x n ))Φ(x n ) } Non-asymptotic results Number of iterations = number of observations Single line of code!

60 Stochastic approximation Relationship to online learning Minimize f(θ) = E z l(θ,z) = generalization error of θ Using the gradients of single i.i.d. observations

61 Stochastic approximation Relationship to online learning Minimize f(θ) = E z l(θ,z) = generalization error of θ Using the gradients of single i.i.d. observations Batch learning Finite set of observations: z 1,...,z n Empirical risk: ˆf(θ) = 1 n n k=1 l(θ,z i) Estimator ˆθ = Minimizer of ˆf(θ) over a certain class Θ Generalization bound using uniform concentration results

62 Stochastic approximation Relationship to online learning Minimize f(θ) = E z l(θ,z) = generalization error of θ Using the gradients of single i.i.d. observations Batch learning Finite set of observations: z 1,...,z n Empirical risk: ˆf(θ) = 1 n n k=1 l(θ,z i) Estimator ˆθ = Minimizer of ˆf(θ) over a certain class Θ Generalization bound using uniform concentration results Online learning Update ˆθ n after each new (potentially adversarial) observation z n Cumulative loss: 1 n n k=1 l(ˆθ k 1,z k ) Online to batch through averaging (Cesa-Bianchi et al., 2004)

63 Convex stochastic approximation Key properties of f and/or f n Smoothness: f B-Lipschitz continuous, f L-Lipschitz continuous Strong convexity: f µ-strongly convex

64 Convex stochastic approximation Key properties of f and/or f n Smoothness: f B-Lipschitz continuous, f L-Lipschitz continuous Strong convexity: f µ-strongly convex Key algorithm: Stochastic gradient descent (a.k.a. Robbins-Monro) θ n = θ n 1 γ n f n(θ n 1 ) Polyak-Ruppert averaging: θ n = 1 n 1 n k=0 θ k Which learning rate sequence γ n? Classical setting: γ n = Cn α

65 Convex stochastic approximation Key properties of f and/or f n Smoothness: f B-Lipschitz continuous, f L-Lipschitz continuous Strong convexity: f µ-strongly convex Key algorithm: Stochastic gradient descent (a.k.a. Robbins-Monro) θ n = θ n 1 γ n f n(θ n 1 ) Polyak-Ruppert averaging: θ n = 1 n 1 n k=0 θ k Which learning rate sequence γ n? Classical setting: γ n = Cn α Desirable practical behavior Applicable (at least) to classical supervised learning problems Robustness to (potentially unknown) constants (L,B,µ) Adaptivity to difficulty of the problem (e.g., strong convexity)

66 Assumptions Stochastic subgradient descent/method f n convex and B-Lipschitz-continuous on { θ 2 D} (f n ) i.i.d. functions such that Ef n = f θ global optimum of f on { θ 2 D} ( Algorithm: θ n = Π D θ n 1 2D ) B n f n(θ n 1 ) Bound: Ef ( 1 n n 1 k=0 θ k ) f(θ ) 2DB n Same three-line proof as in the deterministic case Minimax rate (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Running-time complexity: O(dn) after n iterations

67 Stochastic subgradient method - proof - I Iteration: θ n = Π D (θ n 1 γ n f n(θ n 1 )) with γ n = 2D B n F n : information up to time n f n(θ) 2 B and θ 2 D, unbiased gradients/functions E(f n F n 1 ) = f θ n θ 2 2 θ n 1 θ γ n f n(θ n 1 ) 2 2 by contractivity of projections θ n 1 θ 2 2 +B 2 γ 2 n 2γ n (θ n 1 θ ) f n(θ n 1 ) because f n(θ n 1 ) 2 B E [ θ n θ 2 2 F n 1 ] θn 1 θ 2 2+B 2 γ 2 n 2γ n (θ n 1 θ ) f (θ n 1 ) θ n 1 θ 2 2+B 2 γ 2 n 2γ n [ f(θn 1 ) f(θ ) ] (subgradient property) E θ n θ 2 2 E θ n 1 θ 2 2+B 2 γ 2 n 2γ n [ Ef(θn 1 ) f(θ ) ] leading to Ef(θ n 1 ) f(θ ) B2 γ n [ ] E θn 1 θ 2 2γ 2 E θ n θ 2 2 n

68 Stochastic subgradient method - proof - II Starting from Ef(θ n 1 ) f(θ ) B2 γ n [ ] E θn 1 θ 2 2 E θ n θ 2 2 2γ n n [ Ef(θu 1 ) f(θ ) ] u=1 n u=1 n u=1 B 2 γ u 2 B 2 γ u 2 + n u=1 1 [ E θu 1 θ 2 2γ 2 E θ u θ 2 ] 2 u + 4D2 2γ n 2DB n with γ n = 2D B n Using convexity: Ef ( 1 n n 1 k=0 θ k ) f(θ ) 2DB n

69 Stochastic subgradient descent - strong convexity - I Assumptions f n convex and B-Lipschitz-continuous (f n ) i.i.d. functions such that Ef n = f f µ-strongly convex on { θ 2 D} θ global optimum of f over { θ 2 D} ) 2 Algorithm: θ n = Π D (θ n 1 µ(n+1) f n(θ n 1 ) Bound: ( 2 Ef n(n+1) n ) kθ k 1 k=1 f(θ ) 2B2 µ(n+1) Same proof than deterministic case (Lacoste-Julien et al., 2012) Minimax rate (Nemirovsky and Yudin, 1983; Agarwal et al., 2012)

70 Stochastic subgradient descent - strong convexity - II Assumptions f n convex and B-Lipschitz-continuous (f n ) i.i.d. functions such that Ef n = f θ global optimum of g = f + µ No compactness assumption - no projections Algorithm: 2 2 [ ] θ n = θ n 1 µ(n+1) g n(θ n 1 ) = θ n 1 f µ(n+1) n (θ n 1 )+µθ n 1 ( 2 Bound: Eg n(n+1) n k=1 kθ k 1 ) g(θ ) 2B2 µ(n+1)

71 Outline 1. Large-scale machine learning and optimization Traditional statistical analysis Classical methods for convex optimization 2. Non-smooth stochastic approximation Stochastic (sub)gradient and averaging Non-asymptotic results and lower bounds Strongly convex vs. non-strongly convex 3. Smooth stochastic approximation algorithms Asymptotic and non-asymptotic results Beyond decaying step-sizes 4. Finite data sets

72 Convex stochastic approximation Existing work Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Bottou and Le Cun (2005); Bottou and Bousquet (2008); Hazan et al. (2007); Shalev-Shwartz and Srebro (2008); Shalev-Shwartz et al. (2007, 2009); Xiao (2010); Duchi and Singer (2009); Nesterov and Vial (2008); Nemirovski et al. (2009)

73 Convex stochastic approximation Existing work Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Many contributions in optimization and online learning: Bottou and Le Cun (2005); Bottou and Bousquet (2008); Hazan et al. (2007); Shalev-Shwartz and Srebro (2008); Shalev-Shwartz et al. (2007, 2009); Xiao (2010); Duchi and Singer (2009); Nesterov and Vial (2008); Nemirovski et al. (2009)

74 Convex stochastic approximation Existing work Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Asymptotic analysis of averaging (Polyak and Juditsky, 1992; Ruppert, 1988) All step sizes γ n = Cn α with α (1/2,1) lead to O(n 1 ) for smooth strongly convex problems A single algorithm with global adaptive convergence rate for smooth problems?

75 Convex stochastic approximation Existing work Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Asymptotic analysis of averaging (Polyak and Juditsky, 1992; Ruppert, 1988) All step sizes γ n = Cn α with α (1/2,1) lead to O(n 1 ) for smooth strongly convex problems Non-asymptotic analysis for smooth problems? problems with convergence rate O(min{1/µn,1/ n})

76 Smoothness/convexity assumptions Iteration: θ n = θ n 1 γ n f n(θ n 1 ) Polyak-Ruppert averaging: θ n = 1 n n 1 k=0 θ k Smoothness of f n : For each n 1, the function f n is a.s. convex, differentiable with L-Lipschitz-continuous gradient f n: Smooth loss and bounded data Strong convexity of f: The function f is strongly convex with respect to the norm, with convexity constant µ > 0: Invertible population covariance matrix or regularization by µ 2 θ 2

77 Summary of new results (Bach and Moulines, 2011) Stochastic gradient descent with learning rate γ n = Cn α Strongly convex smooth objective functions Old: O(n 1 ) rate achieved without averaging for α = 1 New: O(n 1 ) rate achieved with averaging for α [1/2,1] Non-asymptotic analysis with explicit constants Forgetting of initial conditions Robustness to the choice of C

78 Summary of new results (Bach and Moulines, 2011) Stochastic gradient descent with learning rate γ n = Cn α Strongly convex smooth objective functions Old: O(n 1 ) rate achieved without averaging for α = 1 New: O(n 1 ) rate achieved with averaging for α [1/2,1] Non-asymptotic analysis with explicit constants Forgetting of initial conditions Robustness to the choice of C Convergence rates for E θ n θ 2 and E θ n θ 2 ( σ 2 γ ) n no averaging: O +O(e µnγ n ) θ 0 θ 2 µ averaging: trh(θ ) 1 ( +µ 1 O(n 2α +n 2+α θ0 θ 2 ) )+O n µ 2 n 2

79 Classical proof sketch (no averaging) θ n θ 2 2 = θ n 1 γ n f n(θ n 1 ) θ 2 2 = θ n 1 θ 2 2 2γ n (θ n 1 θ ) f n(θ n 1 )+γ 2 n f n(θ n 1 ) 2 2 θ n 1 θ 2 2 2γ n (θ n 1 θ ) f n(θ n 1 ) E [ θ n θ 2 2 F n 1 ] +2γn f 2 n(θ ) 2 2+2γn f 2 n(θ n 1 ) f n(θ ) 2 2 θ n 1 θ 2 2 2γ n (θ n 1 θ ) f n(θ n 1 ) +2γn f 2 n(θ ) 2 2+2γnL[f 2 n(θ n 1 ) f n(θ )] (θ n 1 θ ) θ n 1 θ 2 2 2γ n (θ n 1 θ ) f (θ n 1 ) +2γnE f 2 n(θ ) 2 2+2γnL[f 2 (θ n 1 ) 0] (θ n 1 θ ) θ n 1 θ 2 2 2γ n (1 γ n L)(θ n 1 θ ) f (θ n 1 )+2γnσ 2 2 θ n 1 θ 2 2 2γ n (1 γ n L) 1 2 µ θ n 1 θ 2 2+2γ 2 nσ 2 E [ θ n 1 θ 2 ] 2 = [ 1 µγ n (1 γ n L) ] θ n 1 θ 2 2+2γnσ 2 2 [ 1 µγ n (1 γ n L) ] E [ θ n 1 θ 2 ] 2 +2γ 2 n σ 2

80 Proof sketch (averaging) From Polyak and Juditsky (1992): θ n = θ n 1 γ n f n(θ n 1 ) f n(θ n 1 ) = 1 γ n (θ n 1 θ n ) f n(θ )+f n(θ )(θ n 1 θ ) = 1 γ n (θ n 1 θ n )+O( θ n 1 θ 2 ) f n(θ )+f (θ )(θ n 1 θ ) = 1 γ n (θ n 1 θ n )+O( θ n 1 θ 2 ) +O( θ n 1 θ )ε n θ n 1 θ = f (θ ) 1 f n(θ )+ 1 γ n f (θ ) 1 (θ n 1 θ n ) +O( θ n 1 θ 2 )+O( θ n 1 θ )ε n Averaging to cancel the term 1 γ n f (θ ) 1 (θ n 1 θ n )

81 Robustness to wrong constants for γ n = Cn α f(θ) = 1 2 θ 2 with i.i.d. Gaussian noise (d = 1) Left: α = 1/2 Right: α = 1 log[f(θ n ) f ] 5 0 α = 1/2 sgd C=1/5 ave C=1/5 sgd C=1 ave C=1 sgd C=5 ave C=5 log[f(θ n ) f ] 5 0 α = 1 sgd C=1/5 ave C=1/5 sgd C=1 ave C=1 sgd C=5 ave C= log(n) log(n) See also

82 Summary of new results (Bach and Moulines, 2011) Stochastic gradient descent with learning rate γ n = Cn α Strongly convex smooth objective functions Old: O(n 1 ) rate achieved without averaging for α = 1 New: O(n 1 ) rate achieved with averaging for α [1/2,1] Non-asymptotic analysis with explicit constants

83 Summary of new results (Bach and Moulines, 2011) Stochastic gradient descent with learning rate γ n = Cn α Strongly convex smooth objective functions Old: O(n 1 ) rate achieved without averaging for α = 1 New: O(n 1 ) rate achieved with averaging for α [1/2,1] Non-asymptotic analysis with explicit constants Non-strongly convex smooth objective functions Old: O(n 1/2 ) rate achieved with averaging for α = 1/2 New: O(max{n 1/2 3α/2,n α/2,n α 1 }) rate achieved without averaging for α [1/3,1] (worse than with averaging) Take-home message Use α = 1/2 with averaging to be adaptive to strong convexity

84 Beyond stochastic gradient method Adding a proximal step Goal: min θ R df(θ)+ω(θ) = Ef n(θ)+ω(θ) Replace recursion θ n = θ n 1 γ n f n(θ n ) by θ n = min θ R d θ θ n 1 +γ n f n(θ n ) 2 2 +CΩ(θ) Xiao (2010); Hu et al. (2009) May be accelerated (Ghadimi and Lan, 2013) Related frameworks Regularized dual averaging (Nesterov, 2009; Xiao, 2010) Mirror descent (Nemirovski et al., 2009; Lan et al., 2012)

85 Outline 1. Large-scale machine learning and optimization Traditional statistical analysis Classical methods for convex optimization 2. Non-smooth stochastic approximation Stochastic (sub)gradient and averaging Non-asymptotic results and lower bounds Strongly convex vs. non-strongly convex 3. Smooth stochastic approximation algorithms Asymptotic and non-asymptotic results Beyond decaying step-sizes 4. Finite data sets

86 Convex stochastic approximation Existing work Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Asymptotic analysis of averaging (Polyak and Juditsky, 1992; Ruppert, 1988) All step sizes γ n = Cn α with α (1/2,1) lead to O(n 1 ) for smooth strongly convex problems A single adaptive algorithm for smooth problems with convergence rate O(min{1/µn,1/ n}) in all situations?

87 Adaptive algorithm for logistic regression Logistic regression: (Φ(x n ),y n ) R d { 1,1} Single data point: f n (θ) = log(1+exp( y n θ Φ(x n ))) Generalization error: f(θ) = Ef n (θ)

88 Adaptive algorithm for logistic regression Logistic regression: (Φ(x n ),y n ) R d { 1,1} Single data point: f n (θ) = log(1+exp( y n θ Φ(x n ))) Generalization error: f(θ) = Ef n (θ) Cannot be strongly convex local strong convexity unless restricted to θ Φ(x n ) M (and with constants e M ) µ = lowest eigenvalue of the Hessian at the optimum f (θ ) logistic loss

89 Adaptive algorithm for logistic regression Logistic regression: (Φ(x n ),y n ) R d { 1,1} Single data point: f n (θ) = log(1+exp( y n θ Φ(x n ))) Generalization error: f(θ) = Ef n (θ) Cannot be strongly convex local strong convexity unless restricted to θ Φ(x n ) M (and with constants e M ) µ = lowest eigenvalue of the Hessian at the optimum f (θ ) n steps of averaged SGD with constant step-size 1/ ( 2R 2 n ) with R = radius of data (Bach, 2013): Ef( θ n ) f(θ ) min { } 1 n, R2 (15+5R θ0 θ ) 4 nµ Proof based on self-concordance (Nesterov and Nemirovski, 1994)

90 Self-concordance Usual definition for convex ϕ : R R: ϕ (t) 2ϕ (t) 3/2 Affine invariant Extendable to all convex functions on R d by looking at rays Used for the sharp proof of quadratic convergence of Newton method (Nesterov and Nemirovski, 1994) Generalized notion: ϕ (t) ϕ (t) Applicable to logistic regression (with extensions)

91 Self-concordance Usual definition for convex ϕ : R R: ϕ (t) 2ϕ (t) 3/2 Affine invariant Extendable to all convex functions on R d by looking at rays Used for the sharp proof of quadratic convergence of Newton method (Nesterov and Nemirovski, 1994) Generalized notion: ϕ (t) ϕ (t) Applicable to logistic regression (with extensions) Important properties Allows global Taylor expansions Relates expansions of derivatives of different orders

92 Adaptive algorithm for logistic regression Proof sketch Step 1: use existing result f( θ n ) f(θ )+ R2 n θ 0 θ 2 2 = O(1/ n) Step 2: f n(θ n 1 ) = 1 γ (θ n 1 θ n ) n 1 n k=1 f k (θ k 1) = 1 nγ (θ 0 θ n ) Step 3: f ( 1 n n k=1 θ ) k 1 1 n n k=1 f (θ k 1 ) 2 = O ( f( θ n ) f(θ ) ) = O(1/ n) using self-concordance Step 4a: if f µ-strongly convex, f( θ n ) f(θ ) 1 2µ f ( θ n ) 2 2 Step 4b: if f self-concordant, locally true with µ = λ min (f (θ ))

93 Adaptive algorithm for logistic regression Logistic regression: (Φ(x n ),y n ) R d { 1,1} Single data point: f n (θ) = log(1+exp( y n θ Φ(x n ))) Generalization error: f(θ) = Ef n (θ) Cannot be strongly convex local strong convexity unless restricted to θ Φ(x n ) M (and with constants e M ) µ = lowest eigenvalue of the Hessian at the optimum f (θ ) n steps of averaged SGD with constant step-size 1/ ( 2R 2 n ) with R = radius of data (Bach, 2013): Ef( θ n ) f(θ ) min { } 1 n, R2 (15+5R θ0 θ ) 4 nµ Proof based on self-concordance (Nesterov and Nemirovski, 1994)

94 Adaptive algorithm for logistic regression Logistic regression: (Φ(x n ),y n ) R d { 1,1} Single data point: f n (θ) = log(1+exp( y n θ Φ(x n ))) Generalization error: f(θ) = Ef n (θ) Cannot be strongly convex local strong convexity unless restricted to θ Φ(x n ) M (and with constants e M ) µ = lowest eigenvalue of the Hessian at the optimum f (θ ) n steps of averaged SGD with constant step-size 1/ ( 2R 2 n ) with R = radius of data (Bach, 2013): Ef( θ n ) f(θ ) min { } 1 n, R2 (15+5R θ0 θ ) 4 nµ A single adaptive algorithm for smooth problems with convergence rate O(1/n) in all situations?

95 Least-mean-square algorithm Least-squares: f(θ) = 1 2 E[ (y n Φ(x n ),θ ) 2] with θ R d SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) usually studied without averaging and decreasing step-sizes with strong convexity assumption E [ Φ(x n ) Φ(x n ) ] = H µ Id

96 Least-mean-square algorithm Least-squares: f(θ) = 1 2 E[ (y n Φ(x n ),θ ) 2] with θ R d SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) usually studied without averaging and decreasing step-sizes with strong convexity assumption E [ Φ(x n ) Φ(x n ) ] = H µ Id New analysis for averaging and constant step-size γ = 1/(4R 2 ) Assume Φ(x n ) R and y n Φ(x n ),θ σ almost surely No assumption regarding lowest eigenvalues of H Main result: Ef( θ n 1 ) f(θ ) 4σ2 d n + 4R2 θ 0 θ 2 n Matches statistical lower bound (Tsybakov, 2003) Non-asymptotic robust version of Györfi and Walk (1996)

97 Least-squares - Proof technique LMS recursion: θ n θ = [ I γφ(x n ) Φ(x n ) ] (θ n 1 θ )+γε n Φ(x n ) Simplified LMS recursion: with H = E [ Φ(x n ) Φ(x n ) ] θ n θ = [ I γh ] (θ n 1 θ )+γε n Φ(x n ) Direct proof technique of Polyak and Juditsky (1992), e.g., θ n θ = [ I γh ] n (θ0 θ )+γ n k=1 [ I γh ] n kεk Φ(x k ) Infinite expansion of Aguech, Moulines, and Priouret(2000) in powers of γ

98 Markov chain interpretation of constant step sizes LMS recursion for f n (θ) = 1 2 ( yn Φ(x n ),θ ) 2 θ n = θ n 1 γ ( Φ(x n ),θ n 1 y n ) Φ(xn ) The sequence (θ n ) n is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) - For least-squares, θ γ = θ θ n θ γ θ 0

99 Markov chain interpretation of constant step sizes LMS recursion for f n (θ) = 1 2 ( yn Φ(x n ),θ ) 2 θ n = θ n 1 γ ( Φ(x n ),θ n 1 y n ) Φ(xn ) The sequence (θ n ) n is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) For least-squares, θ γ = θ θ n θ θ 0

100 Markov chain interpretation of constant step sizes LMS recursion for f n (θ) = 1 2 ( yn Φ(x n ),θ ) 2 θ n = θ n 1 γ ( Φ(x n ),θ n 1 y n ) Φ(xn ) The sequence (θ n ) n is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) For least-squares, θ γ = θ θ n θ n θ θ 0

101 Markov chain interpretation of constant step sizes LMS recursion for f n (θ) = 1 2 ( yn Φ(x n ),θ ) 2 θ n = θ n 1 γ ( Φ(x n ),θ n 1 y n ) Φ(xn ) The sequence (θ n ) n is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) For least-squares, θ γ = θ θ n does not converge to θ but oscillates around it oscillations of order γ Ergodic theorem: Averaged iterates converge to θ γ = θ at rate O(1/n)

102 Simulations - synthetic examples Gaussian distributions - p = 20 0 synthetic square log 10 [f(θ) f(θ * )] /2R 2 1/8R 2 4 1/32R 2 1/2R 2 n 1/ log 10 (n)

103 Simulations - benchmarks alpha (p = 500, n = ), news (p = , n = ) log 10 [f(θ) f(θ * )] alpha square C=1 test 1/R 2 1/R 2 n 1/2 SAG log (n) alpha square C=opt test C/R 2 C/R 2 n 1/2 SAG log (n) news square C=1 test 0.2 news square C=opt test log 10 [f(θ) f(θ * )] /R 2 1/R 2 n 1/2 SAG log (n) C/R 2 C/R 2 n 1/2 SAG log 10 (n)

104 Beyond least-squares - Markov chain interpretation Recursion θ n = θ n 1 γf n(θ n 1 ) also defines a Markov chain Stationary distribution π γ such that f (θ)π γ (dθ) = 0 When f is not linear, f ( θπ γ (dθ)) f (θ)π γ (dθ) = 0

105 Beyond least-squares - Markov chain interpretation Recursion θ n = θ n 1 γf n(θ n 1 ) also defines a Markov chain Stationary distribution π γ such that f (θ)π γ (dθ) = 0 When f is not linear, f ( θπ γ (dθ)) f (θ)π γ (dθ) = 0 θ n oscillates around the wrong value θ γ θ θ n θ n θ γ θ 0 θ

106 Beyond least-squares - Markov chain interpretation Recursion θ n = θ n 1 γf n(θ n 1 ) also defines a Markov chain Stationary distribution π γ such that f (θ)π γ (dθ) = 0 When f is not linear, f ( θπ γ (dθ)) f (θ)π γ (dθ) = 0 θ n oscillates around the wrong value θ γ θ moreover, θ θ n = O p ( γ) Linear convergence up to the noise level for strongly-convex problems (Nedic and Bertsekas, 2000) Ergodic theorem averaged iterates converge to θ γ θ at rate O(1/n) moreover, θ θ γ = O(γ) (Bach, 2013)

107 Simulations - synthetic examples Gaussian distributions - p = 20 0 synthetic logistic 1 log 10 [f(θ) f(θ * )] 1 2 1/2R 2 3 1/8R 2 4 1/32R 2 5 1/2R 2 n 1/ log 10 (n)

108 Restoring convergence through online Newton steps Known facts 1. Averaged SGD with γ n n 1/2 leads to robust rate O(n 1/2 ) for all convex functions 2. Averaged SGD with γ n constant leads to robust rate O(n 1 ) for all convex quadratic functions 3. Newton s method squares the error at each iteration for smooth functions 4. A single step of Newton s method is equivalent to minimizing the quadratic Taylor expansion Online Newton step Rate: O((n 1/2 ) 2 +n 1 ) = O(n 1 ) Complexity: O(p) per iteration

109 Restoring convergence through online Newton steps Known facts 1. Averaged SGD with γ n n 1/2 leads to robust rate O(n 1/2 ) for all convex functions 2. Averaged SGD with γ n constant leads to robust rate O(n 1 ) for all convex quadratic functions O(n 1 ) 3. Newton s method squares the error at each iteration for smooth functions O((n 1/2 ) 2 ) 4. A single step of Newton s method is equivalent to minimizing the quadratic Taylor expansion Online Newton step Rate: O((n 1/2 ) 2 +n 1 ) = O(n 1 ) Complexity: O(p) per iteration

110 Restoring convergence through online Newton steps The Newton step for f = Ef n (θ) def = E [ l(y n, θ,φ(x n ) ) ] at θ is equivalent to minimizing the quadratic approximation g(θ) = f( θ)+ f ( θ),θ θ θ θ,f ( θ)(θ θ) = f( θ)+ Ef n( θ),θ θ θ θ,ef n( θ)(θ θ) [ = E f( θ)+ f n( θ),θ θ θ θ,f n( θ)(θ θ) ]

111 Restoring convergence through online Newton steps The Newton step for f = Ef n (θ) def = E [ l(y n, θ,φ(x n ) ) ] at θ is equivalent to minimizing the quadratic approximation g(θ) = f( θ)+ f ( θ),θ θ θ θ,f ( θ)(θ θ) = f( θ)+ Ef n( θ),θ θ θ θ,ef n( θ)(θ θ) [ = E f( θ)+ f n( θ),θ θ θ θ,f n( θ)(θ θ) ] Complexity of least-mean-square recursion for g is O(p) θ n = θ n 1 γ [ f n( θ)+f n( θ)(θ n 1 θ) ] f n( θ) = l (y n, θ,φ(x n ) )Φ(x n ) Φ(x n ) has rank one New online Newton step without computing/inverting Hessians

112 Choice of support point for online Newton step Two-stage procedure (1) Run n/2 iterations of averaged SGD to obtain θ (2) Run n/2 iterations of averaged constant step-size LMS Reminiscent of one-step estimators (see, e.g., Van der Vaart, 2000) Provable convergence rate of O(p/n) for logistic regression Additional assumptions but no strong convexity

113 Logistic regression - Proof technique Using generalized self-concordance of ϕ : u log(1+e u ): ϕ (u) ϕ (u) NB: difference with regular self-concordance: ϕ (u) 2ϕ (u) 3/2 Using novel high-probability convergence results for regular averaged stochastic gradient descent Requires assumption on the kurtosis in every direction, i.e., E Φ(x n ),η 4 κ [ E Φ(x n ),η 2] 2

114 Choice of support point for online Newton step Two-stage procedure (1) Run n/2 iterations of averaged SGD to obtain θ (2) Run n/2 iterations of averaged constant step-size LMS Reminiscent of one-step estimators (see, e.g., Van der Vaart, 2000) Provable convergence rate of O(p/n) for logistic regression Additional assumptions but no strong convexity Update at each iteration using the current averaged iterate Recursion: θ n = θ n 1 γ [ f n( θ n 1 )+f n( θ n 1 )(θ n 1 θ n 1 ) ] No provable convergence rate (yet) but best practical behavior Note (dis)similarity with regular SGD: θ n = θ n 1 γf n(θ n 1 )

115 Simulations - synthetic examples Gaussian distributions - p = 20 0 synthetic logistic 1 0 synthetic logistic 2 log 10 [f(θ) f(θ * )] /2R 2 3 1/8R 2 3 every 2 p 4 1/32R 2 every iter. 4 2 step 5 1/2R 2 n 1/2 5 2 step dbl log 10 (n) log 10 (n) log 10 [f(θ) f(θ * )] 1

116 Simulations - benchmarks alpha (p = 500, n = ), news (p = , n = ) alpha logistic C=1 test alpha logistic C=opt test log 10 [f(θ) f(θ * )] /R /R 2 n 1/2 SAG 2 Adagrad Newton log (n) C/R C/R 2 n 1/2 SAG 2 Adagrad Newton log (n) news logistic C=1 test 0.2 news logistic C=opt test log 10 [f(θ) f(θ * )] /R /R 2 n 1/2 SAG 0.8 Adagrad 1 Newton log (n) C/R C/R 2 n 1/2 SAG 0.8 Adagrad 1 Newton log 10 (n)

117 Optimal bounds for least-squares? Least-squares: cannot beat σ 2 d/n (Tsybakov, 2003). Really? Refined assumptions with adaptivity (Dieuleveut and Bach, 2014) f( θ n ) f(θ ) 16σ2 trσ 1/α Extension to non-parametric estimation Previous results: α = 0 and r = 1/2 n (γn) 1/α + 4 H1/2 r (θ 0 θ ) 2 γ 2r n 2min{r,1} Asymptotic analysis (Défossez and Bach, 2015) Bias vs. variance: which one is dominating? Acceleration (Flammarion and Bach, 2015)

118 Bias-variance decomposition (Défossez and Bach, 2015) Simplification: dominating term when n and γ 0 Variance (e.g., starting from the solution) f( θ n ) f(θ ) 1 ] [ε n E 2 Φ(x) H 1 Φ(x) NB: f noise ε is independent, then we obtain dσ2 n Exponentially decaying remainder terms(strongly convex problems) Bias (e.g., no noise) f( θ n ) f(θ ) 1 n 2 γ 2(θ 0 θ ) H 1 (θ 0 θ )

119 Bias-variance decomposition f( θn) f(θ ) (bias) γ = γ 0 /10 (var.) γ = γ 0 /10 (bias) γ = γ 0 (var.) γ = γ 0 (slope = -2) Iteration n

120 Acceleration (Flammarion and Bach, 2015) Existing results (Bach and Moulines, 2013) Variance = σ2 d n { R 2 θ 0 θ 2 Bias min, R4 θ 0 θ,h 1 (θ 0 θ ) } n n 2 Is it possible to get a bias term in R2 θ 0 θ 2 n 2? Corresponds to acceleration (Nesterov, 1983) Best (current) result: σ 2 d n 1 α + R2 θ 0 θ 2 n 1+α

121 Outline 1. Large-scale machine learning and optimization Traditional statistical analysis Classical methods for convex optimization 2. Non-smooth stochastic approximation Stochastic (sub)gradient and averaging Non-asymptotic results and lower bounds Strongly convex vs. non-strongly convex 3. Smooth stochastic approximation algorithms Asymptotic and non-asymptotic results Beyond decaying step-sizes 4. Finite data sets

122 Going beyond a single pass over the data Stochastic approximation Assumes infinite data stream Observations are used only once Directly minimizes testing cost E (x,y) l(y,θ Φ(x))

123 Going beyond a single pass over the data Stochastic approximation Assumes infinite data stream Observations are used only once Directly minimizes testing cost E (x,y) l(y,θ Φ(x)) Machine learning practice Finite data set (x 1,y 1,...,x n,y n ) Multiple passes Minimizes training cost 1 n n i=1 l(y i,θ Φ(x i )) Need to regularize (e.g., by the l 2 -norm) to avoid overfitting Goal: minimize g(θ) = 1 n n f i (θ) i=1

124 Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i,θ Φ(x i ) ) +µω(θ) i=1 Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n Linear (e.g., exponential) convergence rate in O(e αt ) Iteration complexity is linear in n (with line search) n f i(θ t 1 ) i=1

125 Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i,θ Φ(x i ) ) +µω(θ) i=1 Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n n f i(θ t 1 ) i=1

126 Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i,θ Φ(x i ) ) +µω(θ) i=1 Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n Linear (e.g., exponential) convergence rate in O(e αt ) Iteration complexity is linear in n (with line search) n f i(θ t 1 ) i=1 Stochastic gradient descent: θ t = θ t 1 γ t f i(t) (θ t 1) Sampling with replacement: i(t) random element of {1,...,n} Convergence rate in O(1/t) Iteration complexity is independent of n (step size selection?)

127 Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i,θ Φ(x i ) ) +µω(θ) i=1 Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n n f i(θ t 1 ) i=1 Stochastic gradient descent: θ t = θ t 1 γ t f i(t) (θ t 1)

128 Stochastic vs. deterministic methods Goal = best of both worlds: Linear rate with O(1) iteration cost Robustness to step size log(excess cost) stochastic deterministic time

129 Stochastic vs. deterministic methods Goal = best of both worlds: Linear rate with O(1) iteration cost Robustness to step size log(excess cost) hybrid time stochastic deterministic

Large-scale machine learning and convex optimization

Large-scale machine learning and convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE IFCAM, Bangalore - July 2014 Big data revolution? A new scientific