Statistical machine learning and convex optimization

Size: px

Start display at page:

Download "Statistical machine learning and convex optimization"

Chad Horn
6 years ago
Views:

1 Statistical machine learning and convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Mastère M2 - Paris-Sud - Spring 2018 Slides available: 1

2 Statistical machine learning and convex optimization Six classes (lecture notes and slides online), always in 0A2 1. Thursday February 01, 2pm to 5pm 2. Thursday February 08, 2pm to 5pm 3. Thursday February 15, 2pm to 5pm 4. Thursday March 22, 2pm to 5pm 5. Thursday March 29, 2pm to 5pm 6. Thursday April 05, 2pm to 5pm Evaluation 1. Studying research papers with three-page report (see instructions on web site - due April 19) 2. Basic implementations (Matlab / Python / R) 3. Attending 4 out of 6 classes is mandatory Register online 2

3 Big data revolution? A new scientific context Data everywhere: size does not (always) matter Science and industry Size and variety Learning from examples n observations in dimension d 3

4 Search engines - Advertising 4

5 Search engines - Advertising 5

6 Advertising 6

7 Marketing - Personalized recommendation 7

8 Visual object recognition 8

9 Bioinformatics Protein: Crucial elements of cell life Massive data: 2 millions for humans Complex data 9

10 Context Machine learning for big data Large-scale machine learning: large d, large n d : dimension of each observation (input) n : number of observations Examples: computer vision, bioinformatics, advertising Ideal running-time complexity: O(dn) Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951b) Mixing statistics and optimization Using smoothness to go beyond stochastic gradient descent 10

11 Context Machine learning for big data Large-scale machine learning: large d, large n d : dimension of each observation (input) n : number of observations Examples: computer vision, bioinformatics, advertising Ideal running-time complexity: O(dn) Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951b) Mixing statistics and optimization Using smoothness to go beyond stochastic gradient descent 11

12 Context Machine learning for big data Large-scale machine learning: large d, large n d : dimension of each observation (input) n : number of observations Examples: computer vision, bioinformatics, advertising Ideal running-time complexity: O(dn) Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951b) Mixing statistics and optimization Using smoothness to go beyond stochastic gradient descent 12

13 Scaling to large problems Retour aux sources 1950 s: Computers not powerful enough IBM 1620, 1959 CPU frequency: 50 KHz Price > dollars 2010 s: Data too massive - Stochastic gradient methods (Robbins and Monro, 1951a) - Going back to simple methods 13

14 Scaling to large problems Retour aux sources 1950 s: Computers not powerful enough IBM 1620, 1959 CPU frequency: 50 KHz Price > dollars 2010 s: Data too massive Stochastic gradient methods (Robbins and Monro, 1951a) Going back to simple methods 14

15 1. Introduction Outline - II Large-scale machine learning and optimization Classes of functions (convex, smooth, etc.) Traditional statistical analysis through Rademacher complexity 2. Classical methods for convex optimization Smooth optimization (gradient descent, Newton method) Non-smooth optimization (subgradient descent) Proximal methods 3. Non-smooth stochastic approximation Stochastic (sub)gradient and averaging Non-asymptotic results and lower bounds Strongly convex vs. non-strongly convex 15

16 Outline - II 4. Classical stochastic approximation Asymptotic analysis Robbins-Monro algorithm Polyak-Rupert averaging 5. Smooth stochastic approximation algorithms Non-asymptotic analysis for smooth functions Logistic regression Least-squares regression without decaying step-sizes 6. Finite data sets Gradient methods with exponential convergence rates Convex duality (Dual) stochastic coordinate descent - Frank-Wolfe 16

17 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ Φ(x) of features Φ(x) R d (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i,θ Φ(x i ) ) + µω(θ) θ R d n i=1 convex data fitting term + regularizer 17

18 Usual losses Regression: y R, prediction ŷ = θ Φ(x) quadratic loss 1 2 (y ŷ)2 = 1 2 (y θ Φ(x)) 2 18

19 Usual losses Regression: y R, prediction ŷ = θ Φ(x) quadratic loss 1 2 (y ŷ)2 = 1 2 (y θ Φ(x)) 2 Classification : y { 1,1}, prediction ŷ = sign(θ Φ(x)) loss of the form l(yθ Φ(x)) True 0-1 loss: l(yθ Φ(x)) = 1 yθ Φ(x)<0 Usual convex losses: hinge square logistic

20 Main motivating examples Support vector machine (hinge loss): non-smooth l(y,θ Φ(X)) = max{1 Yθ Φ(X),0} Logistic regression: smooth Least-squares regression l(y,θ Φ(X)) = log(1+exp( Yθ Φ(X))) l(y,θ Φ(X)) = 1 2 (Y θ Φ(X)) 2 Structured output regression See Tsochantaridis et al. (2005); Lacoste-Julien et al. (2013) 20

21 Usual regularizers Main goal: avoid overfitting (squared) Euclidean norm: θ 2 2 = d j=1 θ j 2 Numerically well-behaved Representer theorem and kernel methods : θ = n i=1 α iφ(x i ) See, e.g., Schölkopf and Smola (2001); Shawe-Taylor and Cristianini (2004) 21

22 Usual regularizers Main goal: avoid overfitting (squared) Euclidean norm: θ 2 2 = d j=1 θ j 2 Numerically well-behaved Representer theorem and kernel methods : θ = n i=1 α iφ(x i ) See, e.g., Schölkopf and Smola (2001); Shawe-Taylor and Cristianini (2004) Sparsity-inducing norms Main example: l 1 -norm θ 1 = d j=1 θ j Perform model selection as well as regularization Non-smooth optimization and structured sparsity See, e.g., Bach, Jenatton, Mairal, and Obozinski (2012b,a) 22

23 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ Φ(x) of features Φ(x) R d (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i,θ Φ(x i ) ) + µω(θ) θ R d n i=1 convex data fitting term + regularizer 23

24 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ Φ(x) of features Φ(x) R d (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i,θ Φ(x i ) ) + µω(θ) θ R d n i=1 convex data fitting term + regularizer Empirical risk: ˆf(θ) = 1 n n i=1 l(y i,θ Φ(x i )) training cost Expected risk: f(θ) = E (x,y) l(y,θ Φ(x)) testing cost Two fundamental questions: (1) computing ˆθ and (2) analyzing ˆθ May be tackled simultaneously 24

25 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ Φ(x) of features Φ(x) R d (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i,θ Φ(x i ) ) + µω(θ) θ R d n i=1 convex data fitting term + regularizer Empirical risk: ˆf(θ) = 1 n n i=1 l(y i,θ Φ(x i )) training cost Expected risk: f(θ) = E (x,y) l(y,θ Φ(x)) testing cost Two fundamental questions: (1) computing ˆθ and (2) analyzing ˆθ May be tackled simultaneously 25

26 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ Φ(x) of features Φ(x) R d (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i,θ Φ(x i ) ) such that Ω(θ) D θ R d n i=1 convex data fitting term + constraint Empirical risk: ˆf(θ) = 1 n n i=1 l(y i,θ Φ(x i )) training cost Expected risk: f(θ) = E (x,y) l(y,θ Φ(x)) testing cost Two fundamental questions: (1) computing ˆθ and (2) analyzing ˆθ May be tackled simultaneously 26

27 General assumptions Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Bounded features Φ(x) R d : Φ(x) 2 R Empirical risk: ˆf(θ) = 1 n n i=1 l(y i,θ Φ(x i )) training cost Expected risk: f(θ) = E (x,y) l(y,θ Φ(x)) testing cost Loss for a single observation: f i (θ) = l(y i,θ Φ(x i )) i, f(θ) = Ef i (θ) Properties of f i,f, ˆf Convex on R d Additional regularity assumptions: Lipschitz-continuity, smoothness and strong convexity 27

28 Convexity Global definitions g(θ) θ 28

29 Convexity Global definitions (full domain) g(θ) g(θ 2 ) g(θ 1 ) Not assuming differentiability: θ 2 θ 1 θ θ 1,θ 2,α [0,1], g(αθ 1 +(1 α)θ 2 ) αg(θ 1 )+(1 α)g(θ 2 ) 29

30 Convexity Global definitions (full domain) g(θ) g(θ 2 ) g(θ 1 ) Assuming differentiability: θ 2 θ 1 θ θ 1,θ 2, g(θ 1 ) g(θ 2 )+g (θ 2 ) (θ 1 θ 2 ) Extensions to all functions with subgradients / subdifferential 30

31 Subgradients and subdifferentials Given g : R d R convex g(θ ) g(θ) θ g(θ)+s (θ θ) s R d is a subgradient of g at θ if and only if θ R d,g(θ ) g(θ)+s (θ θ) Subdifferential g(θ) = set of all subgradients at θ If g is differentiable at θ, then g(θ) = {g (θ)} Example: absolute value The subdifferential is never empty! See Rockafellar (1997) θ 31

32 Convexity Global definitions (full domain) g(θ) θ Local definitions Twice differentiable functions θ, g (θ) 0 (positive semi-definite Hessians) 32

33 Convexity Global definitions (full domain) g(θ) θ Local definitions Twice differentiable functions θ, g (θ) 0 (positive semi-definite Hessians) Why convexity? 33

34 Why convexity? Local minimum = global minimum Optimality condition (non-smooth): 0 g(θ) Optimality condition (smooth): g (θ) = 0 Convex duality See Boyd and Vandenberghe (2003) Recognizing convex problems See Boyd and Vandenberghe (2003) 34

35 Lipschitz continuity Bounded gradients of g ( Lipschitz-continuity): the function g if convex, differentiable and has (sub)gradients uniformly bounded by B on the ball of center 0 and radius D: θ R d, θ 2 D g (θ) 2 B θ,θ R d, θ 2, θ 2 D g(θ) g(θ ) B θ θ 2 Machine learning with g(θ) = 1 n n i=1 l(y i,θ Φ(x i )) G-Lipschitz loss and R-bounded data: B = GR 35

36 Smoothness and strong convexity A function g : R d R is L-smooth if and only if it is differentiable and its gradient is L-Lipschitz-continuous θ 1,θ 2 R d, g (θ 1 ) g (θ 2 ) 2 L θ 1 θ 2 2 If g is twice differentiable: θ R d, g (θ) L Id smooth non smooth 36

37 Smoothness and strong convexity A function g : R d R is L-smooth if and only if it is differentiable and its gradient is L-Lipschitz-continuous θ 1,θ 2 R d, g (θ 1 ) g (θ 2 ) 2 L θ 1 θ 2 2 If g is twice differentiable: θ R d, g (θ) L Id Machine learning with g(θ) = 1 n n i=1 l(y i,θ Φ(x i )) Hessian covariance matrix 1 n n i=1 Φ(x i)φ(x i ) L loss -smooth loss and R-bounded data: L = L loss R 2 37

38 Smoothness and strong convexity A function g : R d R is µ-strongly convex if and only if θ 1,θ 2 R d, g(θ 1 ) g(θ 2 )+g (θ 2 ) (θ 1 θ 2 )+ µ 2 θ 1 θ If g is twice differentiable: θ R d, g (θ) µ Id convex strongly convex 38

39 Smoothness and strong convexity A function g : R d R is µ-strongly convex if and only if θ 1,θ 2 R d, g(θ 1 ) g(θ 2 )+g (θ 2 ) (θ 1 θ 2 )+ µ 2 θ 1 θ If g is twice differentiable: θ R d, g (θ) µ Id (large µ/l) (small µ/l) 39

40 Smoothness and strong convexity A function g : R d R is µ-strongly convex if and only if θ 1,θ 2 R d, g(θ 1 ) g(θ 2 )+g (θ 2 ) (θ 1 θ 2 )+ µ 2 θ 1 θ If g is twice differentiable: θ R d, g (θ) µ Id Machine learning with g(θ) = 1 n n i=1 l(y i,θ Φ(x i )) Hessian covariance matrix 1 n n i=1 Φ(x i)φ(x i ) Data with invertible covariance matrix (low correlation/dimension) 40

41 Smoothness and strong convexity A function g : R d R is µ-strongly convex if and only if θ 1,θ 2 R d, g(θ 1 ) g(θ 2 )+g (θ 2 ) (θ 1 θ 2 )+ µ 2 θ 1 θ If g is twice differentiable: θ R d, g (θ) µ Id Machine learning with g(θ) = 1 n n i=1 l(y i,θ Φ(x i )) Hessian covariance matrix 1 n n i=1 Φ(x i)φ(x i ) Data with invertible covariance matrix (low correlation/dimension) Adding regularization by µ 2 θ 2 creates additional bias unless µ is small 41

42 Summary of smoothness/convexity assumptions Bounded gradients of g (Lipschitz-continuity): the function g if convex, differentiable and has (sub)gradients uniformly bounded by B on the ball of center 0 and radius D: θ R d, θ 2 D g (θ) 2 B Smoothness of g: the function g is convex, differentiable with L-Lipschitz-continuous gradient g (e.g., bounded Hessians): θ 1,θ 2 R d, g (θ 1 ) g (θ 2 ) 2 L θ 1 θ 2 2 Strong convexity of g: The function g is strongly convex with respect to the norm, with convexity constant µ > 0: θ 1,θ 2 R d, g(θ 1 ) g(θ 2 )+g (θ 2 ) (θ 1 θ 2 )+ µ 2 θ 1 θ

43 Analysis of empirical risk minimization Approximation and estimation errors: Θ = {θ R d,ω(θ) D} [ ] [ ] f(ˆθ) min θ R df(θ) = f(ˆθ) min f(θ) + minf(θ) min θ Θ θ Θ θ R df(θ) Estimation error Approximation error NB: may replace min by best (non-linear) predictions θ Rdf(θ) 43

44 Analysis of empirical risk minimization Approximation and estimation errors: Θ = {θ R d,ω(θ) D} [ ] [ ] f(ˆθ) min θ R df(θ) = f(ˆθ) min f(θ) + minf(θ) min θ Θ θ Θ θ R df(θ) Estimation error Approximation error 1. Uniform deviation bounds, with ˆθ argmin θ Θ ˆf(θ) f(ˆθ) min θ Θ f(θ) = [ f(ˆθ) ˆf(ˆθ) ] + [ˆf(ˆθ) ˆf((θ ) Θ ) ] + [ˆf((θ ) Θ ) f((θ supf(θ) 0 +sup θ Θ θ Θ ˆf(θ) f(θ) 44

45 Analysis of empirical risk minimization Approximation and estimation errors: Θ = {θ R d,ω(θ) D} [ ] [ ] f(ˆθ) min θ R df(θ) = f(ˆθ) min f(θ) + minf(θ) min θ Θ θ Θ θ R df(θ) Estimation error Approximation error 1. Uniform deviation bounds, with ˆθ argmin θ Θ ˆf(θ) f(ˆθ) min θ Θ Typically slow rate O ( 1/ n ) f(θ) supf(θ) ˆf(θ)+sup θ Θ θ Θ ˆf(θ) f(θ) 2. More refined concentration results with faster rates O(1/n) 45

46 Analysis of empirical risk minimization Approximation and estimation errors: Θ = {θ R d,ω(θ) D} [ ] [ ] f(ˆθ) min θ R df(θ) = f(ˆθ) min f(θ) + minf(θ) min θ Θ θ Θ θ R df(θ) Estimation error Approximation error 1. Uniform deviation bounds, with ˆθ argmin θ Θ ˆf(θ) f(ˆθ) min θ Θ Typically slow rate O ( 1/ n ) f(θ) 2 sup f(θ) ˆf(θ) θ Θ 2. More refined concentration results with faster rates O(1/n) 46

47 Motivation from least-squares For least-squares, we have l(y,θ Φ(x)) = 1 2 (y θ Φ(x)) 2, and sup θ 2 D ˆf(θ) f(θ) = 1 ( 1 n 2 θ Φ(x i )Φ(x i ) EΦ(X)Φ(X) )θ n i=1 ( 1 n ) θ y i Φ(x i ) EYΦ(X) + 1 ( 1 n yi 2 EY ), 2 n 2 n i=1 i=1 1 n i )Φ(x i ) n i=1φ(x EΦ(X)Φ(X) op +D 1 n y i Φ(x i ) EYΦ(X) n n 2 2 yi 2 EY 2, n f(θ) ˆf(θ) D2 2 i=1 sup f(θ) ˆf(θ) O(1/ n) with high probability from 3 concentrations θ 2 D i=1 47

48 Slow rate for supervised learning Assumptions (f is the expected risk, ˆf the empirical risk) Ω(θ) = θ 2 (Euclidean norm) Linear predictors: θ(x) = θ Φ(x), with Φ(x) 2 R a.s. G-Lipschitz loss: f and ˆf are GR-Lipschitz on Θ = { θ 2 D} No assumptions regarding convexity 48

49 Slow rate for supervised learning Assumptions (f is the expected risk, ˆf the empirical risk) Ω(θ) = θ 2 (Euclidean norm) Linear predictors: θ(x) = θ Φ(x), with Φ(x) 2 R a.s. G-Lipschitz loss: f and ˆf are GR-Lipschitz on Θ = { θ 2 D} No assumptions regarding convexity With probability greater than 1 δ sup ˆf(θ) f(θ) l 0+GRD [2+ θ Θ n 2log 2 δ Expectated estimation error: E [ sup ˆf(θ) f(θ) ] 4l 0+4GRD θ Θ n Using Rademacher averages (see, e.g., Boucheron et al., 2005) Lipschitz functions slow rate ] 49

50 Symmetrization with Rademacher variables Let D = {x 1,y 1,...,x n,y n} an independent copy of the data D = {x 1,y 1,...,x n,y n }, with corresponding loss functions f i (θ) E [ supf(θ) ˆf(θ) ] = E [ sup θ Θ θ Θ [ = E [ E E sup θ Θ [ [ = E sup θ Θ [ = E sup [ 2E θ Θ ( f(θ) 1 n n 1 n sup θ Θ 1 n 1 n sup θ Θ 1 n i=1 n i=1 ) ] f i (θ) E ( f i(θ) f i (θ) D ) ( f i (θ) f i (θ) ) ]] D 1 n n i=1 n ( f i (θ) f i (θ) )] i=1 n ( ε i f i (θ) f i (θ) )] with ε i uniform in { 1,1} i=1 n i=1 ] ε i f i (θ) = Rademacher complexity 50

51 Rademacher complexity Rademacher complexity of the class of functions (X,Y) l(y,θ Φ(X)) [ n ] R n = E ε i f i (θ) sup θ Θ 1 n i=1 with f i (θ) = l(x i,θ Φ(x i )), (x i,y i ), i.i.d NB 1: two expectations, with respect to D and with respect to ε Empirical Rademacher average ˆR n by conditioning on D 1 n NB 2: sometimes defined as sup θ Θ n i=1 ε if i (θ) Main property: E [ supf(θ) ˆf(θ) ] and E [ sup θ Θ θ Θ ˆf(θ) f(θ) ] 2R n 51

52 From Rademacher complexity to uniform bound Let Z = sup θ Θ f(θ) ˆf(θ) By changing the pair (x i,y i ), Z may only change by 2 n sup l(y,θ Φ(X)) 2 ( ) 2( sup l(y,0) +GRD l0 +GRD ) = c n n with sup l(y,0) = l 0 MacDiarmid inequality: with probability greater than 1 δ, Z EZ + n 2 c log 1 δ 2R n+ 2 n ( l0 +GRD ) log 1 δ 52

53 Bounding the Rademacher average - I Wehave, withϕ i (u) = l(y i,u) l(y i,0)isalmostsurelyg-lipschitz: ˆR n = E ε [sup θ Θ E ε [sup θ Θ 1 n 1 n n i=1 n i=1 0+E ε [sup θ Θ = 0+E ε [sup θ Θ 1 n 1 n ] ε i f i (θ) [ ε i f i (0) ]+E ε sup θ Θ 1 n n [ ε i fi (θ) f i (0) ]] i=1 n i=1 ] ε i ϕ i (θ Φ(x i )) n [ ε i fi (θ) f i (0) ]] Using Ledoux-Talagrand contraction results for Rademacher averages (since ϕ i is G-Lipschitz), we get (Meir and Zhang, 2003): ˆR n 2G E ε [ sup θ 2 D 1 n n i=1 i=1 ] ε i θ Φ(x i ) 53

54 Proof of Ledoux-Talagrand lemma (Meir and Zhang, 2003, Lemma 5) Given any b, a i : Θ R (no assumption) and ϕ i : R R any 1-Lipschitz-functions, i = 1,...,n E ε [supb(θ)+ θ Θ n i=1 Proof by induction on n n = 0: trivial From n to n+1: see next slide ] [ ε i ϕ i (a i (θ)) E ε supb(θ)+ θ Θ n i=1 ] ε i a i (θ) 54

55 From n to n+1 E ε1,...,ε n+1 [sup θ Θ = E ε1,...,εn = E ε1,...,εn E ε1,...,εn [ [ [ sup θ,θ Θ sup θ,θ Θ sup θ,θ Θ = E ε1,...,εne εn+1 [sup θ Θ E ε1,...,εn,ε n+1 [sup θ Θ b(θ) + n+1 i=1 b(θ) + b(θ ) 2 b(θ) + b(θ ) 2 b(θ) + b(θ ) 2 ] ε i ϕ i (a i (θ)) n i=1 n i=1 n i=1 b(θ) + ε n+1 a n+1 (θ) + b(θ) + ε n+1 a n+1 (θ) + ε i ϕ i (a i (θ)) + ϕ i (a i (θ )) 2 ε i ϕ i (a i (θ)) + ϕ i (a i (θ )) 2 ε i ϕ i (a i (θ)) + ϕ i (a i (θ )) 2 n i=1 ] ε i ϕ i (a i (θ)) n ] ε i a i (θ) by recursion i=1 + ϕ n+1(a n+1 (θ)) ϕ n+1 (a n+1 (θ 2 + ϕ n+1(a n+1 (θ)) ϕ n+1 (a n+1 (θ 2 + a n+1(θ) a n+1 (θ ] ) 2 55

56 We have: Bounding the Rademacher average - II [ R n 2GE = 2GE D1 n 2GD sup θ 2 D n i=1 E 1 n 1 n n i=1 ε i Φ(x i ) ] ε i θ Φ(x i ) 2 n ε i Φ(x i ) i=1 2 2 by Jensen s inequality 2GRD n by using Φ(x) 2 R and independence Overall, we get, with probability 1 δ: sup f(θ) ˆf(θ) 1 ( l0 +GRD)(4+ θ Θ n 2log 1 δ ) 56

57 We have, with probability 1 δ Putting it all together For exact minimizer ˆθ argmin θ Θ ˆf(θ), we have f(ˆθ) min θ Θ f(θ) sup θ Θ For inexact minimizer η Θ f(η) min θ Θ ˆf(θ) f(θ)+sup θ Θ 2 n ( l0 +GRD)(4+ f(θ) ˆf(θ) 2log 1 δ f(θ) 2 sup ˆf(θ) f(θ) + [ˆf(η) ] ˆf(ˆθ) θ Θ Only need to optimize with precision 2 n (l 0 +GRD) ) 57

58 We have, with probability 1 δ Putting it all together For exact minimizer ˆθ argmin θ Θ ˆf(θ), we have f(ˆθ) min θ Θ For inexact minimizer η Θ f(η) min θ Θ f(θ) 2 sup ˆf(θ) f(θ) θ Θ 2 n ( l0 +GRD)(4+ 2log 1 δ f(θ) 2 sup ˆf(θ) f(θ) + [ˆf(η) ] ˆf(ˆθ) θ Θ Only need to optimize with precision 2 n (l 0 +GRD) ) 58

59 Slow rate for supervised learning (summary) Assumptions (f is the expected risk, ˆf the empirical risk) Ω(θ) = θ 2 (Euclidean norm) Linear predictors: θ(x) = θ Φ(x), with Φ(x) 2 R a.s. G-Lipschitz loss: f and ˆf are GR-Lipschitz on Θ = { θ 2 D} No assumptions regarding convexity With probability greater than 1 δ sup ˆf(θ) f(θ) (l 0+GRD) [2+ θ Θ n 2log 2 δ Expectated estimation error: E [ sup ˆf(θ) f(θ) ] 4(l 0+GRD) θ Θ n Using Rademacher averages (see, e.g., Boucheron et al., 2005) Lipschitz functions slow rate ] 59

60 Motivation from mean estimation Estimator ˆθ = 1 n n i=1 z i = argmin θ R 1 2n n i=1 (θ z i) 2 = ˆf(θ) From before: f(θ) = 1 2 E(θ z)2 = 1 2 (θ Ez) var(z) = ˆf(θ)+O(1/ n) f(ˆθ) = 1 2 (ˆθ Ez) var(z) = f(ez)+o(1/ n) 60

61 Motivation from mean estimation Estimator ˆθ = 1 n n i=1 z i = argmin θ R 1 2n n i=1 (θ z i) 2 = ˆf(θ) θ = Ez = argmin θ R 1 2 E(θ z)2 = f(θ) From before (estimation error): f(ˆθ) f(θ ) = O(1/ n) Direct computation: f(θ) = 1 2 E(θ z)2 = 1 2 (θ Ez) var(z) More refined/direct bound: f(ˆθ) f(ez) = 1 2 (ˆθ Ez) 2 E [ f(ˆθ) f(ez) ] = 1 2 E ( 1 n n i=1 ) 2 z i Ez = 1 2n var(z) Bound only at ˆθ + strong convexity (instead of uniform bound) 61

62 Fast rate for supervised learning Assumptions (f is the expected risk, ˆf the empirical risk) Same as before (bounded features, Lipschitz loss) Regularized risks: f µ (θ) = f(θ)+ µ 2 θ 2 2 and ˆf µ (θ) = ˆf(θ)+ µ 2 θ 2 2 Convexity For any a > 0, with probability greater than 1 δ, for all θ R d, f µ (ˆθ) min (η) 8G2 R 2 (32+log 1 δ ) η R dfµ µn Results from Sridharan, Srebro, and Shalev-Shwartz (2008) see also Boucheron and Massart (2011) and references therein Strongly convex functions fast rate Warning: µ should decrease with n to reduce approximation error 62

63 1. Introduction Outline - II Large-scale machine learning and optimization Classes of functions (convex, smooth, etc.) Traditional statistical analysis through Rademacher complexity 2. Classical methods for convex optimization Smooth optimization (gradient descent, Newton method) Non-smooth optimization (subgradient descent) Proximal methods 3. Non-smooth stochastic approximation Stochastic (sub)gradient and averaging Non-asymptotic results and lower bounds Strongly convex vs. non-strongly convex 63

64 Outline - II 4. Classical stochastic approximation Asymptotic analysis Robbins-Monro algorithm Polyak-Rupert averaging 5. Smooth stochastic approximation algorithms Non-asymptotic analysis for smooth functions Logistic regression Least-squares regression without decaying step-sizes 6. Finite data sets Gradient methods with exponential convergence rates Convex duality (Dual) stochastic coordinate descent - Frank-Wolfe 64

65 Complexity results in convex optimization Assumption: g convex on R d Classical generic algorithms Gradient descent and accelerated gradient descent Newton method Subgradient method and ellipsoid algorithm 65

66 Complexity results in convex optimization Assumption: g convex on R d Classical generic algorithms Gradient descent and accelerated gradient descent Newton method Subgradient method and ellipsoid algorithm Key additional properties of g Lipschitz continuity, smoothness or strong convexity Key insight from Bottou and Bousquet (2008) In machine learning, no need to optimize below estimation error Key references: Nesterov (2004), Bubeck (2015) 66

67 Several criteria for characterizing convergence Objective function values Usually weaker condition Iterates g(θ) inf η R dg(η) inf η argming θ η 2 Typically used for strongly-convex problems NB 1: relationships between the two types in several situations (see later) NB 2: similarity with prediction vs. estimation in statistics 67

68 (smooth) gradient descent Assumptions g convex with L-Lipschitz-continuous gradient (e.g., L-smooth) Minimum attained at θ Algorithm: θ t = θ t 1 1 L g (θ t 1 ) 68

69 (smooth) gradient descent - strong convexity Assumptions g convex with L-Lipschitz-continuous gradient (e.g., L-smooth) g µ-strongly convex Algorithm: Bound: θ t = θ t 1 1 L g (θ t 1 ) g(θ t ) g(θ ) (1 µ/l) t[ g(θ 0 ) g(θ ) ] Three-line proof Line search, steepest descent or constant step-size 69

70 Assumptions (smooth) gradient descent - slow rate g convex with L-Lipschitz-continuous gradient (e.g., L-smooth) Minimum attained at θ Algorithm: Bound: Four-line proof θ t = θ t 1 1 L g (θ t 1 ) g(θ t ) g(θ ) 2L θ 0 θ 2 t+4 Adaptivity of gradient descent to problem difficulty Not best possible convergence rates after O(d) iterations 70

71 Gradient descent - Proof for quadratic functions Quadratic convex function: g(θ) = 1 2 θ Hθ c θ µ and L are smallest largest eigenvalues of H Global optimum θ = H 1 c (or H c) Gradient descent: θ t = θ t 1 1 L (Hθ t 1 c) = θ t 1 1 L (Hθ t 1 Hθ ) θ t θ = (I 1 L H)(θ t 1 θ ) = (I 1 L H)t (θ 0 θ ) Strong convexity µ > 0: eigenvalues of (I 1 L H)t in [0,(1 µ L )t ] Convergence of iterates: θ t θ 2 (1 µ/l) 2t θ 0 θ 2 Function values: g(θ t ) g(θ ) (1 µ/l) 2t[ g(θ 0 ) g(θ ) ] Function values: g(θ t ) g(θ ) L t θ 0 θ 2 71

72 Gradient descent - Proof for quadratic functions Quadratic convex function: g(θ) = 1 2 θ Hθ c θ µ and L are smallest largest eigenvalues of H Global optimum θ = H 1 c (or H c) Gradient descent: θ t = θ t 1 1 L (Hθ t 1 c) = θ t 1 1 L (Hθ t 1 Hθ ) θ t θ = (I 1 L H)(θ t 1 θ ) = (I 1 L H)t (θ 0 θ ) Convexity µ = 0: eigenvalues of (I 1 L H)t in [0,1] No convergence of iterates: θ t θ 2 θ 0 θ 2 Function values: g(θ t ) g(θ ) max v [0,L] v(1 v/l) 2t θ 0 θ 2 Function values: g(θ t ) g(θ ) L t θ 0 θ 2 72

73 Properties of smooth convex functions Let g : R d R a convex L-smooth function. Then for all θ,η R d : Definition: g (θ) g (η) L θ η If twice differentiable: 0 g (θ) LI Quadratic upper-bound: 0 g(θ) g(η) g (η) (θ η) L 2 θ η 2 Taylor expansion with integral remainder Co-coercivity: 1 L g (θ) g (η) 2 [ g (θ) g (η) ] (θ η) If g is µ-strongly convex (no need for smoothness), then g(θ) g(η)+g (η) (θ η)+ 1 2µ g (θ) g (η) 2 Distance to optimum: g(θ) g(θ ) g (θ) (θ θ ) 73

74 Proof of co-coercivity Quadratic upper-bound: 0 g(θ) g(η) g (η) (θ η) L 2 θ η 2 Taylor expansion with integral remainder Lower bound: g(θ) g(η)+g (η) (θ η)+ 1 2L g (θ) g (η) 2 Define h(θ) = g(θ) θ g (η), convex with global minimum at η h(η) h(θ 1 L h (θ)) h(θ)+h (θ) ( 1 L h (θ)))+ L 2 1 L h (θ)) 2, which is thus less than h(θ) 1 2L h (θ) 2 Thus g(η) η g (η) g(θ) θ g (η) 1 2L g (θ) g (η) 2 Proof of co-coercivity Apply lower bound twice for (η,θ) and (θ,η), and sum to get 0 [g (η) g (θ)] (θ η)+ 1 L g (θ) g (η) 2 NB: simple proofs with second-order derivatives 74

75 Proof of g(θ) g(η)+g (η) (θ η)+ 1 2µ g (θ) g (η) 2 Define h(θ) = g(θ) θ g (η), convex with global minimum at η h(η) = min θ h(θ) min ζ h(θ)+h (θ) (ζ θ)+ µ 2 ζ θ 2, which is attained for ζ θ = 1 µ h (θ) This leads to h(η) h(θ) 1 2µ h (θ) 2 Hence, g(η) η g (η) g(θ) θ g (η) 1 2µ g (η) (θ) 2 NB: no need for smooothness NB: simple proofs with second-order derivatives With η = θ global minimizer, another distance to optimum g(θ) g(θ ) 1 2µ g (θ) 2 Polyak-Lojasiewicz 75

76 Convergence proof - gradient descent smooth strongly convex functions Iteration: θ t = θ t 1 γg (θ t 1 ) with γ = 1/L g(θ t ) = g [ θ t 1 γg (θ t 1 ) ] g(θ t 1 )+g (θ t 1 ) [ γg (θ t 1 ) ] + L 2 γg (θ t 1 ) 2 = g(θ t 1 ) γ(1 γl/2) g (θ t 1 ) 2 = g(θ t 1 ) 1 2L g (θ t 1 ) 2 if γ = 1/L, g(θ t 1 ) µ L[ g(θt 1 ) g(θ ) ] using strongly-convex distance to optimum Thus, g(θ t ) g(θ ) (1 µ/l) t[ g(θ 0 ) g(θ ) ] May also get (Nesterov, 2004): θ t θ 2 ( 1 2γµL ) t θ0 µ+l θ 2 as soon as γ 2 µ+l 76

77 Convergence proof - gradient descent smooth convex functions - II Iteration: θ t = θ t 1 γg (θ t 1 ) with γ = 1/L θ t θ 2 = θ t 1 θ γg (θ t 1 ) 2 = θ t 1 θ 2 +γ 2 g (θ t 1 ) 2 2γ(θ t 1 θ ) g (θ t 1 ) θ t 1 θ 2 +γ 2 g (θ t 1 ) 2 2 γ L g (θ t 1 ) 2 using co-coercivity = θ t 1 θ 2 γ(2/l γ) g (θ t 1 ) 2 θ t 1 θ 2 if γ 2/L θ 0 θ 2 : bounded iterates g(θ t ) g(θ t 1 ) 1 2L g (θ t 1 ) 2 (see previous slide) g(θ t 1 ) g(θ ) g (θ t 1 ) (θ t 1 θ ) g (θ t 1 ) θ t 1 θ (Cauchy-Schwarz) 1 [ g(θ t ) g(θ ) g(θ t 1 ) g(θ ) g(θt 1 2L θ 0 θ 2 ) g(θ ) ] 2 77

78 Convergence proof - gradient descent smooth convex functions - II Iteration: θ t = θ t 1 γg (θ t 1 ) with γ = 1/L g(θ t ) g(θ ) g(θ t 1 ) g(θ ) 1 2L θ 0 θ 2 [ g(θt 1 ) g(θ ) ] 2 of the form k k 1 α 2 k 1 with 0 k = g(θ k ) g(θ ) L 2 θ k θ 2 1 k 1 1 k 1 1 k α k 1 k by dividing by k k 1 1 k α because ( k ) is non-increasing 1 1 αt by summing from k = 1 to t 0 t 0 t by inverting 1+αt 0 2L θ 0 θ 2 t+4 since 0 L 2 θ k θ 2 and α = 1 2L θ 0 θ 2 78

79 Limits on convergence rate of first-order methods First-order method: any iterative algorithm that selects θ t in θ 0 +span(f (θ 0 ),...,f (θ t 1 )) Problem class: minimizer θ convex L-smooth functions with a global Theorem: for every integer t (d 1)/2 and every θ 0, there exist functions in the problem class such that for any first-order method, g(θ t ) g(θ ) 3 L θ 0 θ 2 32 (t+1) 2 O(1/t) rate for gradient method may not be optimal! 79

80 Limits on convergence rate of first-order methods Proof sketch Define quadratic function g t (θ) = L 8[ (θ 1 ) 2 + t 1 (θ i θ i+1 ) 2 +(θ t ) 2 2θ 1] i=1 Fact 1: g t is L-smooth Fact 2: minimizer supported by first t coordinates (closed form) Fact 3: any first-order method starting from zero will be supported in the first k coordinates after iteration k Fact 4: the minimum over this support in {1,...,k} may be computed in closed form Given iteration k, take g = g 2k+1 and compute lower-bound on g(θ k ) g(θ ) θ 0 θ 2 80

81 Accelerated gradient methods (Nesterov, 1983) Assumptions g convex with L-Lipschitz-cont. gradient, min. attained at θ Algorithm: θ t = η t 1 1 L g (η t 1 ) Bound: η t = θ t + t 1 t+2 (θ t θ t 1 ) g(θ t ) g(θ ) 2L θ 0 θ 2 (t+1) 2 Ten-line proof (see, e.g., Schmidt, Le Roux, and Bach, 2011) Not improvable Extension to strongly-convex functions 81

82 Accelerated gradient methods - strong convexity Assumptions g convex with L-Lipschitz-cont. gradient, min. attained at θ g µ-strongly convex Algorithm: θ t = η t 1 1 L g (η t 1 ) η t = θ t + 1 µ/l 1+ µ/l (θ t θ t 1 ) Bound: g(θ t ) f(θ ) L θ 0 θ 2 (1 µ/l) t Ten-line proof (see, e.g., Schmidt, Le Roux, and Bach, 2011) Not improvable Relationship with conjugate gradient for quadratic functions 82

83 Optimization for sparsity-inducing norms (see Bach, Jenatton, Mairal, and Obozinski, 2012b) Gradient descent as a proximal method (differentiable functions) θ t+1 = arg min θ R df(θ t)+(θ θ t ) f(θ t )+ L 2 θ θ t 2 2 θ t+1 = θ t 1 L f(θ t) 83

84 Optimization for sparsity-inducing norms (see Bach, Jenatton, Mairal, and Obozinski, 2012b) Gradient descent as a proximal method (differentiable functions) θ t+1 = arg min θ R df(θ t)+(θ θ t ) f(θ t )+ L 2 θ θ t 2 2 θ t+1 = θ t 1 L f(θ t) Problems of the form: min θ R df(θ)+µω(θ) θ t+1 = arg min θ R df(θ t)+(θ θ t ) f(θ t )+µω(θ)+ L 2 θ θ t 2 2 Ω(θ) = θ 1 Thresholded gradient descent Similar convergence rates than smooth optimization Acceleration methods (Nesterov, 2007; Beck and Teboulle, 2009) 84

85 Soft-thresholding for the l 1 -norm 1 Example 1: quadratic problem in 1D, i.e. min x R 2 x2 xy +λ x Piecewise quadratic function with a kink at zero Derivative at 0+: g + = λ y and 0 : g = λ y x = 0 is the solution iff g + 0 and g 0 (i.e., y λ) x 0 is the solution iff g + 0 (i.e., y λ) x = y λ x 0 is the solution iff g 0 (i.e., y λ) x = y +λ Solution x = sign(y)( y λ) + = soft thresholding 85

86 Soft-thresholding for the l 1 -norm 1 Example 1: quadratic problem in 1D, i.e. min x R 2 x2 xy +λ x Piecewise quadratic function with a kink at zero Solution x = sign(y)( y λ) + = soft thresholding x x*(y) λ λ y 86

87 Projected gradient descent Problems of the form: min θ K f(θ) θ t+1 = argmin f(θ t)+(θ θ t ) f(θ t )+ L θ K 2 θ θ t θ t+1 = argmin θ ( θ t 1 θ K 2 L f(θ t) ) 2 2 Projected gradient descent Similar convergence rates than smooth optimization Acceleration methods (Nesterov, 2007; Beck and Teboulle, 2009) 87

88 Newton method Given θ t 1, minimize second-order Taylor expansion g(θ) = g(θ t 1 )+g (θ t 1 ) (θ θ t 1 )+ 1 2 (θ θ t 1) g (θ t 1 ) (θ θ t 1 ) Expensive Iteration: θ t = θ t 1 g (θ t 1 ) 1 g (θ t 1 ) Running-time complexity: O(d 3 ) in general Quadratic convergence: If θ t 1 θ small enough, for some constant C, we have (C θ t θ ) = (C θ t 1 θ ) 2 See Boyd and Vandenberghe (2003) 88

89 Summary: minimizing smooth convex functions Assumption: g convex Gradient descent: θ t = θ t 1 γ t g (θ t 1 ) O(1/t) convergence rate for smooth convex functions O(e tµ/l ) convergence rate for strongly smooth convex functions Optimal rates O(1/t 2 ) and O(e t µ/l ) Newton method: θ t = θ t 1 f (θ t 1 ) 1 f (θ t 1 ) O ( e ρ2t ) convergence rate 89

90 Summary: minimizing smooth convex functions Assumption: g convex Gradient descent: θ t = θ t 1 γ t g (θ t 1 ) O(1/t) convergence rate for smooth convex functions O(e tµ/l ) convergence rate for strongly smooth convex functions Optimal rates O(1/t 2 ) and O(e t µ/l ) Newton method: θ t = θ t 1 f (θ t 1 ) 1 f (θ t 1 ) O ( e ρ2t ) convergence rate From smooth to non-smooth Subgradient method and ellipsoid 90

91 Counter-example (Bertsekas, 1999) Steepest descent for nonsmooth objectives g(θ 1,θ 2 ) = { 5(9θ θ 2 2) 1/2 if θ 1 > θ 2 (9θ θ 2 ) 1/2 if θ 1 θ 2 Steepest descent starting from any θ such that θ 1 > θ 2 > (9/16) 2 θ

92 Subgradient method/ descent (Shor et al., 1985) Assumptions g convex and B-Lipschitz-continuous on { θ 2 D} ( Algorithm: θ t = Π D θ t 1 2D ) B t g (θ t 1 ) Π D : orthogonal projection onto { θ 2 D} Constraints 92

93 Subgradient method/ descent (Shor et al., 1985) Assumptions g convex and B-Lipschitz-continuous on { θ 2 D} ( Algorithm: θ t = Π D θ t 1 2D ) B t g (θ t 1 ) Π D : orthogonal projection onto { θ 2 D} Bound: Three-line proof ( 1 g t t 1 k=0 θ k ) g(θ ) 2DB t Best possible convergence rate after O(d) iterations (Bubeck, 2015) 93

94 Subgradient method/ descent - proof - I Iteration: θ t = Π D (θ t 1 γ t g (θ t 1 )) with γ t = 2D B t Assumption: g (θ) 2 B and θ 2 D θ t θ 2 2 θ t 1 θ γ t g (θ t 1 ) 2 2 by contractivity of projections θ t 1 θ 2 2+B 2 γ 2 t 2γ t (θ t 1 θ ) g (θ t 1 ) because g (θ t 1 ) 2 B θ t 1 θ 2 2+B 2 γ 2 t 2γ t [ g(θt 1 ) g(θ ) ] (property of subgradients) leading to g(θ t 1 ) g(θ ) B2 γ t [ θt 1 θ 2 2γ 2 θ t θ 2 ] 2 t 94

95 Subgradient method/ descent - proof - II Starting from g(θ t 1 ) g(θ ) B2 γ t 2 Constant step-size γ t = γ t [ g(θu 1 ) g(θ ) ] u=1 t u=1 B 2 γ 2 + t u=1 + 1 [ θt 1 θ 2 2γ 2 θ t θ 2 ] 2 t 1 [ θu 1 θ 2 2γ 2 θ u θ 2 2] t B2 γ γ θ 0 θ 2 2 t B2 γ γ D2 Optimized step-size γ t = 2D B t depends on horizon Leads to bound of 2DB t ( 1 t 1 ) Using convexity: g θ k g(θ ) 1 t t k=0 t 1 k=0 g(θ k ) g(θ ) 2DB t 95

96 Subgradient method/ descent - proof - III Starting from Decreasing step-size t [ g(θu 1 ) g(θ ) ] u=1 g(θ t 1 ) g(θ ) B2 γ t 2 = = t u=1 t u=1 B 2 γ u 2 t u=1 t u=1 B 2 γ u 2 B 2 γ u 2 B 2 γ u 2 + u=1 t u=1 + 1 [ θt 1 θ 2 2γ 2 θ t θ 2 ] 2 t 1 [ θu 1 θ 2 2γ 2 θ u θ 2 ] 2 u t 1 ( + θ u θ ) θ 0 θ θ t θ 2 2 2γ u+1 2γ u 2γ 1 2γ t + t 1 u=1 4D 2( 1 2γ u+1 1 2γ u ) + 4D 2 2γ 1 + 4D2 2γ t 3DB t with γ t = 2D B t Using convexity: g ( 1 t t 1 k=0 θ ) k g(θ ) 3DB t 96

97 Subgradient descent for machine learning Assumptions (f is the expected risk, ˆf the empirical risk) Linear predictors: θ(x) = θ Φ(x), with Φ(x) 2 R a.s. ˆf(θ) = 1 n n i=1 l(y i,φ(x i ) θ) G-Lipschitz loss: f and ˆf are GR-Lipschitz on Θ = { θ 2 D} Statistics: with probability greater than 1 δ sup ˆf(θ) f(θ) GRD [2+ 2log 2 θ Θ n δ ] Optimization: after t iterations of subgradient method ˆf(ˆθ) min η Θ ˆf(η) GRD t t = n iterations, with total running-time complexity of O(n 2 d) 97

98 Subgradient descent - strong convexity Assumptions g convex and B-Lipschitz-continuous on { θ 2 D} g µ-strongly convex ) 2 Algorithm: θ t = Π D (θ t 1 µ(t+1) g (θ t 1 ) Bound: ( 2 g t(t+1) t k=1 kθ k 1 ) g(θ ) 2B2 µ(t+1) Three-line proof Best possible convergence rate after O(d) iterations (Bubeck, 2015) 98

99 Subgradient method - strong convexity - proof - I Iteration: θ t = Π D (θ t 1 γ t g (θ t 1 )) with γ t = 2 µ(t+1) Assumption: g (θ) 2 B and θ 2 D and µ-strong convexity of f θ t θ 2 2 θ t 1 θ γ t g (θ t 1 ) 2 2 by contractivity of projections θ t 1 θ 2 2+B 2 γt 2 2γ t (θ t 1 θ ) g (θ t 1 ) because g (θ t 1 ) 2 θ t 1 θ 2 2+B 2 γt 2 [ 2γ t g(θt 1 ) g(θ )+ µ 2 θ t 1 θ 2 ] 2 (property of subgradients and strong convexity) leading to g(θ t 1 ) g(θ ) B2 γ t B 2 µ(t+1) + µ 2 [1 γ t µ ] θ t 1 θ γ t θ t θ 2 2 [t 1 2 ] θt 1 θ 2 2 µ(t+1) θ t θ

100 Subgradient method - strong convexity - proof - II From g(θ t 1 ) g(θ ) B2 µ(t+1) + µ 2 [t 1 2 ] θt 1 θ 2 2 µ(t+1) θ t θ t u [ g(θ u 1 ) g(θ ) ] u=1 u t=1 B 2 u µ(u+1) t [ u(u 1) θu 1 θ 2 2 u(u+1) θ u θ 2 2] u=1 B2 t µ + 1 [ 0 t(t+1) θt θ 2 B 4 2] 2 t µ ( 2 Using convexity: g t(t+1) t u=1 uθ u 1 ) g(θ ) 2B2 t+1 NB: with step-size γ n = 1/(nµ), extra logarithmic factor 100

101 Ellipsoid method Minimizing convex function g : R d R Builds a sequence of ellipsoids that contains the global minima. E 0 E 1 E 2 E 1 Represent E t = {θ R d,(θ θ t ) P 1 t (θ θ t ) 1} Fact 1: θ t+1 = θ t 1 d+1 P th t and P t+1 = d2 d 2 1 (P t 2 d+1 P th t h t P t ) 1 with h t = g (θ t ) P t g (x t ) g (θ t ) Fact 2: vol(e t ) vol(e t 1 )e 1/2d CV rate in O(e t/d2 ) 101

102 Summary: minimizing convex functions Gradient descent: θ t = θ t 1 γ t g (θ t 1 ) O(1/ t) convergence rate for non-smooth convex functions O(1/t) convergence rate for smooth convex functions O(e ρt ) convergence rate for strongly smooth convex functions Newton method: θ t = θ t 1 g (θ t 1 ) 1 g (θ t 1 ) O ( e ρ2t ) convergence rate - Key insights from Bottou and Bousquet (2008) - In machine learning, no need to optimize below statistical error - In machine learning, cost functions are averages - Testing errors are more important than training errors Stochastic approximation 102

103 Summary: minimizing convex functions Gradient descent: θ t = θ t 1 γ t g (θ t 1 ) O(1/ t) convergence rate for non-smooth convex functions O(1/t) convergence rate for smooth convex functions O(e ρt ) convergence rate for strongly smooth convex functions Newton method: θ t = θ t 1 g (θ t 1 ) 1 g (θ t 1 ) O ( e ρ2t ) convergence rate Key insights from Bottou and Bousquet (2008) 1. In machine learning, no need to optimize below statistical error 2. In machine learning, cost functions are averages 3. Testing errors are more important than training errors Stochastic approximation 103

104 Problem parameters Summary of rates of convergence D diameter of the domain B Lipschitz-constant L smoothness constant µ strong convexity constant convex strongly convex nonsmooth deterministic: BD/ t deterministic: B 2 /(tµ) stochastic: BD/ n stochastic: B 2 /(nµ) smooth deterministic: LD 2 /t 2 deterministic: exp( t µ/l) stochastic: LD 2 / n stochastic: L/(nµ) finite sum: n/t finite sum: exp( min{1/n, µ/l} quadratic deterministic: LD 2 /t 2 deterministic: exp( t µ/l) stochastic: d/n+ld 2 /n stochastic: d/n+ld 2 /n 104

105 1. Introduction Outline - II Large-scale machine learning and optimization Classes of functions (convex, smooth, etc.) Traditional statistical analysis through Rademacher complexity 2. Classical methods for convex optimization Smooth optimization (gradient descent, Newton method) Non-smooth optimization (subgradient descent) Proximal methods 3. Non-smooth stochastic approximation Stochastic (sub)gradient and averaging Non-asymptotic results and lower bounds Strongly convex vs. non-strongly convex 105

106 Outline - II 4. Classical stochastic approximation Asymptotic analysis Robbins-Monro algorithm Polyak-Rupert averaging 5. Smooth stochastic approximation algorithms Non-asymptotic analysis for smooth functions Logistic regression Least-squares regression without decaying step-sizes 6. Finite data sets Gradient methods with exponential convergence rates Convex duality (Dual) stochastic coordinate descent - Frank-Wolfe 106

107 Stochastic approximation Goal: Minimizing a function f defined on R d given only unbiased estimates f n(θ n ) of its gradients f (θ n ) at certain points θ n R d 107

108 Stochastic approximation Goal: Minimizing a function f defined on R d given only unbiased estimates f n(θ n ) of its gradients f (θ n ) at certain points θ n R d Machine learning - statistics loss for a single pair of observations: f n (θ) = l(y n,θ Φ(x n )) f(θ) = Ef n (θ) = El(y n,θ Φ(x n )) = generalization error Expected gradient: f (θ) = Ef n(θ) = E { l (y n,θ Φ(x n ))Φ(x n ) } Non-asymptotic results Number of iterations = number of observations 108

109 Stochastic approximation Goal: Minimizing a function f defined on R d given only unbiased estimates f n(θ n ) of its gradients f (θ n ) at certain points θ n R d Stochastic approximation (much) broader applicability beyond convex optimization θ n = θ n 1 γ n h n (θ n 1 ) with E [ h n (θ n 1 ) θ n 1 ] = h(θn 1 ) Beyond convex problems, i.i.d assumption, finite dimension, etc. Typically asymptotic results (see next lecture) See, e.g., Kushner and Yin (2003); Benveniste et al. (2012) 109

110 Stochastic approximation Relationship to online learning Minimize f(θ) = E z l(θ,z) = generalization error of θ Using the gradients of single i.i.d. observations 110

111 Stochastic approximation Relationship to online learning Minimize f(θ) = E z l(θ,z) = generalization error of θ Using the gradients of single i.i.d. observations Batch learning Finite set of observations: z 1,...,z n Empirical risk: ˆf(θ) = 1 n n k=1 l(θ,z i) Estimator ˆθ = Minimizer of ˆf(θ) over a certain class Θ Generalization bound using uniform concentration results 111

112 Stochastic approximation Relationship to online learning Minimize f(θ) = E z l(θ,z) = generalization error of θ Using the gradients of single i.i.d. observations Batch learning Finite set of observations: z 1,...,z n Empirical risk: ˆf(θ) = 1 n n k=1 l(θ,z i) Estimator ˆθ = Minimizer of ˆf(θ) over a certain class Θ Generalization bound using uniform concentration results Online learning Update ˆθ n after each new (potentially adversarial) observation z n Cumulative loss: 1 n n k=1 l(ˆθ k 1,z k ) Online to batch through averaging (Cesa-Bianchi et al., 2004) 112

113 Convex stochastic approximation Key properties of f and/or f n Smoothness: f B-Lipschitz continuous, f L-Lipschitz continuous Strong convexity: f µ-strongly convex 113

114 Convex stochastic approximation Key properties of f and/or f n Smoothness: f B-Lipschitz continuous, f L-Lipschitz continuous Strong convexity: f µ-strongly convex Key algorithm: Stochastic gradient descent (a.k.a. Robbins-Monro) θ n = θ n 1 γ n f n(θ n 1 ) Polyak-Ruppert averaging: θ n = 1 n 1 n k=0 θ k Which learning rate sequence γ n? Classical setting: γ n = Cn α 114

115 Convex stochastic approximation Key properties of f and/or f n Smoothness: f B-Lipschitz continuous, f L-Lipschitz continuous Strong convexity: f µ-strongly convex Key algorithm: Stochastic gradient descent (a.k.a. Robbins-Monro) θ n = θ n 1 γ n f n(θ n 1 ) Polyak-Ruppert averaging: θ n = 1 n 1 n k=0 θ k Which learning rate sequence γ n? Classical setting: γ n = Cn α Desirable practical behavior Applicable (at least) to classical supervised learning problems Robustness to (potentially unknown) constants (L,B,µ) Adaptivity to difficulty of the problem (e.g., strong convexity) 115

116 Stochastic subgradient descent /method Assumptions f n convex and B-Lipschitz-continuous on { θ 2 D} (f n ) i.i.d. functions such that Ef n = f θ global optimum of f on C = { θ 2 D} ( Algorithm: θ n = Π D θ n 1 2D ) B n f n(θ n 1 ) 116

117 Stochastic subgradient descent /method Assumptions f n convex and B-Lipschitz-continuous on { θ 2 D} (f n ) i.i.d. functions such that Ef n = f θ global optimum of f on C = { θ 2 D} ( Algorithm: θ n = Π D θ n 1 2D ) B n f n(θ n 1 ) Bound: Ef ( 1 n n 1 k=0 θ k ) f(θ ) 2DB n Same three-line proof as in the deterministic case Minimax rate (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Running-time complexity: O(dn) after n iterations 117

118 Stochastic subgradient method - proof - I Iteration: θ n = Π D (θ n 1 γ n f n(θ n 1 )) with γ n = 2D B n F n : information up to time n f n(θ) 2 B and θ 2 D, unbiased gradients/functions E(f n F n 1 ) = f θ n θ 2 2 θ n 1 θ γ n f n(θ n 1 ) 2 2 by contractivity of projections θ n 1 θ 2 2 +B 2 γ 2 n 2γ n (θ n 1 θ ) f n(θ n 1 ) because f n(θ n 1 ) 2 E [ θ n θ 2 2 F n 1 ] θn 1 θ 2 2+B 2 γ 2 n 2γ n (θ n 1 θ ) f (θ n 1 ) θ n 1 θ 2 2+B 2 γ 2 n 2γ n [ f(θn 1 ) f(θ ) ] (subgradient proper E θ n θ 2 2 E θ n 1 θ 2 2+B 2 γ 2 n 2γ n [ Ef(θn 1 ) f(θ ) ] leading to Ef(θ n 1 ) f(θ ) B2 γ n [ ] E θn 1 θ 2 2γ 2 E θ n θ 2 2 n 118

119 Stochastic subgradient method - proof - II Starting from Ef(θ n 1 ) f(θ ) B2 γ n [ ] E θn 1 θ 2 2 E θ n θ 2 2 2γ n n [ Ef(θu 1 ) f(θ ) ] u=1 n u=1 n u=1 B 2 γ u 2 B 2 γ u 2 + n u=1 1 [ E θu 1 θ 2 2γ 2 E θ u θ 2 ] 2 u + 4D2 2γ n 2DB n with γ n = 2D B n Using convexity: Ef ( 1 n n 1 k=0 θ k ) f(θ ) 2DB n 119

120 Stochastic subgradient method Extension to online learning Assume different and arbitrary functions f n : R d R Observations of f n(θ n 1 )+ε n with E(ε n F n 1 ) = 0 and f n(θ n 1 )+ε n B almost surely Performance criterion: (normalized) regret 1 n n i=1 f i (θ i 1 ) inf θ 2 D 1 n n i=1 f i (θ) Warning: often not normalized May not be non-negative (typically is) 120

121 Stochastic subgradient method - online learning - I Iteration: θ n = Π D (θ n 1 γ n (f n(θ n 1 )+ε n )) with γ n = 2D B n F n : information up to time n - θ an arbitrary point such that θ D f n(θ n 1 )+ε n 2 B and θ 2 D, unbiased gradients E(ε n F n 1 ) = 0 θ n θ 2 2 θ n 1 θ γ n (f n(θ n 1 )+ε n ) 2 2 by contractivity of projections θ n 1 θ 2 2+B 2 γ 2 n 2γ n (θ n 1 θ) (f n(θ n 1 )+ε n ) because f n(θ n 1 )+ E [ θ n θ 2 2 F n 1 ] θn 1 θ 2 2+B 2 γ 2 n 2γ n (θ n 1 θ) f n(θ n 1 ) θ n 1 θ 2 2+B 2 γ 2 n 2γ n [ fn (θ n 1 ) f n (θ) ] (subgradient propert E θ n θ 2 2 E θ n 1 θ 2 2+B 2 γ 2 n 2γ n [ Efn (θ n 1 ) f n (θ) ] leading to Ef n (θ n 1 ) f n (θ) B2 γ n [ ] E θn 1 θ 2 2γ 2 E θ n θ 2 2 n 121

122 Stochastic subgradient method - online learning - II Starting from Ef n (θ n 1 ) f n (θ) B2 γ n [ E θn 1 θ 2 2γ 2 E θ n θ 2 ] 2 n n [ Efu (θ u 1 ) f u (θ) ] u=1 n u=1 n u=1 B 2 γ u 2 B 2 γ u 2 + n u=1 1 [ E θu 1 θ 2 2γ 2 E θ u θ 2 ] 2 u + 4D2 2γ n 2DB n with γ n = 2D B n For any θ such that θ D: 2DB n 1 n n k=1 Ef k (θ k 1 ) 1 n n k=1 f k (θ) Online to batch conversion: assuming convexity 122

123 Stochastic subgradient descent - strong convexity - I Assumptions f n convex and B-Lipschitz-continuous (f n ) i.i.d. functions such that Ef n = f f µ-strongly convex on { θ 2 D} θ global optimum of f over { θ 2 D} ) 2 Algorithm: θ n = Π D (θ n 1 µ(n+1) f n(θ n 1 ) Bound: ( 2 Ef n(n+1) n ) kθ k 1 k=1 f(θ ) 2B2 µ(n+1) Same proof than deterministic case (Lacoste-Julien et al., 2012) Minimax rate (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) 123

124 Stochastic subgradient - strong convexity - proof - I Iteration: θ n = Π D (θ n 1 γ n f n(θ t 1 )) with γ n = 2 µ(n+1) Assumption: f n(θ) 2 B and θ 2 D and µ-strong convexity of f θ n θ 2 2 θ n 1 θ γ n f n(θ t 1 ) 2 2 by contractivity of projections θ n 1 θ 2 2+B 2 γn 2γ 2 n (θ n 1 θ ) f n(θ t 1 ) because f n(θ t 1 ) 2 E( F n 1 ) θ n 1 θ 2 2+B 2 γn 2γ 2 [ n f(θn 1 ) f(θ )+ µ 2 θ n 1 θ 2 ] 2 (property of subgradients and strong convexity) leading to Ef(θ n 1 ) f(θ ) B2 γ n B 2 µ(n+1) + µ 2 [ 1 γ n µ ] θ n 1 θ γ n θ n θ 2 2 [n 1 2 ] θn 1 θ 2 2 µ(n+1) θ n θ

125 Stochastic subgradient - strong convexity - proof - II FromEf(θ n 1 ) f(θ ) B2 µ(n+1) +µ 2 [n 1 2 ] E θn 1 θ 2 2 µ(n+1) E θ n θ n u [ Ef(θ u 1 ) f(θ ) ] u=1 n u=1 B 2 u µ(u+1) n [ u(u 1)E θu 1 θ 2 2 u(u+1)e θ u u=1 B2 n µ + 1 [ 0 n(n+1)e θn θ 2 B 4 2] 2 n µ ( 2 Using convexity: Ef n(n+1) n u=1 uθ u 1 ) g(θ ) 2B2 n+1 NB: with step-size γ n = 1/(nµ), extra logarithmic factor (see later) 125

126 Stochastic subgradient descent - strong convexity - II Assumptions f n convex and B-Lipschitz-continuous (f n ) i.i.d. functions such that Ef n = f θ global optimum of g = f + µ No compactness assumption - no projections Algorithm: 2 2 [ ] θ n = θ n 1 µ(n+1) g n(θ n 1 ) = θ n 1 f µ(n+1) n (θ n 1 )+µθ n 1 ( 2 Bound: Eg n(n+1) n k=1 Minimax convergence rate kθ k 1 ) g(θ ) 2B2 µ(n+1) 126

127 Strong convexity - proof with logn factor - I Iteration: θ n = Π D (θ n 1 γ n f n(θ t 1 )) with γ n = 1 µn Assumption: f n(θ) 2 B and θ 2 D and µ-strong convexity of f θ n θ 2 2 θ n 1 θ γ n f n(θ t 1 ) 2 2 by contractivity of projections θ n 1 θ 2 2+B 2 γn 2γ 2 n (θ n 1 θ ) f n(θ t 1 ) because f n(θ t 1 ) 2 E( F n 1 ) θ n 1 θ 2 2+B 2 γn 2γ 2 [ n f(θn 1 ) f(θ )+ µ 2 θ n 1 θ 2 ] 2 (property of subgradients and strong convexity) leading to Ef(θ n 1 ) f(θ ) B2 γ n [ 1 γ n µ ] θ n 1 θ γ n θ n θ 2 2 B2 2µn + µ [ ] n 1 θn 1 θ nµ 2 θ n θ

128 Strong convexity - proof with logn factor - II From Ef(θ n 1 ) f(θ ) B2 2µn + µ [ ] n 1 θn 1 θ 2 2 nµ 2 2 θ n θ 2 2 n [ Ef(θu 1 ) f(θ ) ] u=1 n u=1 B 2 2µu B2 logn 2µ n [ (u 1)E θu 1 θ 2 2 ue θ u θ 2 2] u= [ 0 ne θn θ 2 2 ] B 2 logn 2µ Using convexity: Ef ( 1 n n u=1 Why could this be useful? θ u 1 ) f(θ ) B2 logn 2µ n 128

129 Stochastic subgradient descent - strong convexity Online learning Need logn term for uniform averaging. For all θ: 1 n n i=1 f i (θ i 1 ) 1 n n i=1 f i (θ) B2 logn 2µ n Optimal. See Hazan and Kale (2014). 129

130 Typical result: Ef Beyond convergence in expectation ( 1 n n 1 k=0 θ k ) f(θ ) 2DB n Obtained with simple conditioning arguments High-probability bounds ( Markov inequality: P f ( 1 n ) n 1 k=0 θ k ) f(θ ) ε 2DB nε 130

131 Typical result: Ef Beyond convergence in expectation ( 1 n n 1 k=0 θ k ) f(θ ) 2DB n Obtained with simple conditioning arguments High-probability bounds ( Markov inequality: P f ( 1 n ) n 1 k=0 θ k ) f(θ ) ε 2DB nε Deviation inequality (Nemirovski et al., 2009; Nesterov and Vial, 2008) ( ( n 1 1 P f n k=0 θ k ) f(θ ) 2DB ) (2+4t) n 2exp( t 2 ) See also Bach (2013) for logistic regression 131

132 Stochastic subgradient method - high probability - I Iteration: θ n = Π D (θ n 1 γ n f n(θ n 1 )) with γ n = 2D B n F n : information up to time n f n(θ) 2 B and θ 2 D, unbiased gradients/functions E(f n F n 1 ) = f θ n θ 2 2 θ n 1 θ γ n f n(θ n 1 ) 2 2 by contractivity of projections θ n 1 θ 2 2 +B 2 γ 2 n 2γ n (θ n 1 θ ) f n(θ n 1 ) because f n(θ n 1 ) 2 E [ θ n θ 2 2 F n 1 ] θn 1 θ 2 2+B 2 γ 2 n 2γ n (θ n 1 θ ) f (θ n 1 ) θ n 1 θ 2 2+B 2 γ 2 n 2γ n [ f(θn 1 ) f(θ ) ] (subgradient proper Without expectations and with Z n = 2γ n (θ n 1 θ ) [f n(θ n 1 ) f (θ n 1 )] θ n θ 2 2 θ n 1 θ 2 2+B 2 γ 2 n 2γ n [ f(θn 1 ) f(θ ) ] +Z n 132

133 Stochastic subgradient method - high probability - II Without expectations and with Z n = 2γ n (θ n 1 θ ) [f n(θ n 1 ) f (θ n 1 )] θ n θ 2 2 θ n 1 θ 2 2+B 2 γ 2 n 2γ n [ f(θn 1 ) f(θ ) ] +Z n f(θ n 1 ) f(θ ) 1 [ θn 1 θ 2 2γ 2 θ n θ 2 ] B 2 γ n 2 + n 2 + Z n 2γ n n [ f(θu 1 ) f(θ ) ] u=1 n u=1 n u=1 B 2 γ u 2 B 2 γ u 2 + n u=1 + 4D2 2γ n + 1 [ θu 1 θ 2 2γ 2 θ u θ 2 ] n 2 + u u=1 n n u=1 Z u 2DB + 2γ u n u=1 Z u 2γ u Z u with γ n = 2D 2γ u B n Need to study n u=1 Z u 2γ u with E(Z n F n 1 ) = 0 and Z n 8γ n DB 133

134 Stochastic subgradient method - high probability - III Need to study n u=1 Z u 2γ u with E( Z n 2γ n F n 1 ) = 0 and Z n 4DB Azuma-Hoeffding inequality for bounded martingale increments: ( n P u=1 Z u t ) n 4DB 2γ u exp ( t2 2 ) Moments with Burkholder-Rosenthal-Pinelis inequality(pinelis, 1994) 134

135 Beyond stochastic gradient method Adding a proximal step Goal: min θ R df(θ)+ω(θ) = Ef n(θ)+ω(θ) Replace recursion θ n = θ n 1 γ n f n(θ n ) by θ n = min θ R d θ θ n 1 +γ n f n(θ n ) 2 2 +CΩ(θ) Xiao (2010); Hu et al. (2009) May be accelerated (Ghadimi and Lan, 2013) Related frameworks Regularized dual averaging (Nesterov, 2009; Xiao, 2010) Mirror descent (Nemirovski et al., 2009; Lan et al., 2012) 135

Large-scale machine learning and convex optimization

Large-scale machine learning and convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE IFCAM, Bangalore - July 2014 Big data revolution? A new scientific