Statistical machine learning and convex optimization
|
|
- Chad Horn
- 6 years ago
- Views:
Transcription
1 Statistical machine learning and convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Mastère M2 - Paris-Sud - Spring 2018 Slides available: 1
2 Statistical machine learning and convex optimization Six classes (lecture notes and slides online), always in 0A2 1. Thursday February 01, 2pm to 5pm 2. Thursday February 08, 2pm to 5pm 3. Thursday February 15, 2pm to 5pm 4. Thursday March 22, 2pm to 5pm 5. Thursday March 29, 2pm to 5pm 6. Thursday April 05, 2pm to 5pm Evaluation 1. Studying research papers with three-page report (see instructions on web site - due April 19) 2. Basic implementations (Matlab / Python / R) 3. Attending 4 out of 6 classes is mandatory Register online 2
3 Big data revolution? A new scientific context Data everywhere: size does not (always) matter Science and industry Size and variety Learning from examples n observations in dimension d 3
4 Search engines - Advertising 4
5 Search engines - Advertising 5
6 Advertising 6
7 Marketing - Personalized recommendation 7
8 Visual object recognition 8
9 Bioinformatics Protein: Crucial elements of cell life Massive data: 2 millions for humans Complex data 9
10 Context Machine learning for big data Large-scale machine learning: large d, large n d : dimension of each observation (input) n : number of observations Examples: computer vision, bioinformatics, advertising Ideal running-time complexity: O(dn) Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951b) Mixing statistics and optimization Using smoothness to go beyond stochastic gradient descent 10
11 Context Machine learning for big data Large-scale machine learning: large d, large n d : dimension of each observation (input) n : number of observations Examples: computer vision, bioinformatics, advertising Ideal running-time complexity: O(dn) Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951b) Mixing statistics and optimization Using smoothness to go beyond stochastic gradient descent 11
12 Context Machine learning for big data Large-scale machine learning: large d, large n d : dimension of each observation (input) n : number of observations Examples: computer vision, bioinformatics, advertising Ideal running-time complexity: O(dn) Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951b) Mixing statistics and optimization Using smoothness to go beyond stochastic gradient descent 12
13 Scaling to large problems Retour aux sources 1950 s: Computers not powerful enough IBM 1620, 1959 CPU frequency: 50 KHz Price > dollars 2010 s: Data too massive - Stochastic gradient methods (Robbins and Monro, 1951a) - Going back to simple methods 13
14 Scaling to large problems Retour aux sources 1950 s: Computers not powerful enough IBM 1620, 1959 CPU frequency: 50 KHz Price > dollars 2010 s: Data too massive Stochastic gradient methods (Robbins and Monro, 1951a) Going back to simple methods 14
15 1. Introduction Outline - II Large-scale machine learning and optimization Classes of functions (convex, smooth, etc.) Traditional statistical analysis through Rademacher complexity 2. Classical methods for convex optimization Smooth optimization (gradient descent, Newton method) Non-smooth optimization (subgradient descent) Proximal methods 3. Non-smooth stochastic approximation Stochastic (sub)gradient and averaging Non-asymptotic results and lower bounds Strongly convex vs. non-strongly convex 15
16 Outline - II 4. Classical stochastic approximation Asymptotic analysis Robbins-Monro algorithm Polyak-Rupert averaging 5. Smooth stochastic approximation algorithms Non-asymptotic analysis for smooth functions Logistic regression Least-squares regression without decaying step-sizes 6. Finite data sets Gradient methods with exponential convergence rates Convex duality (Dual) stochastic coordinate descent - Frank-Wolfe 16
17 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ Φ(x) of features Φ(x) R d (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i,θ Φ(x i ) ) + µω(θ) θ R d n i=1 convex data fitting term + regularizer 17
18 Usual losses Regression: y R, prediction ŷ = θ Φ(x) quadratic loss 1 2 (y ŷ)2 = 1 2 (y θ Φ(x)) 2 18
19 Usual losses Regression: y R, prediction ŷ = θ Φ(x) quadratic loss 1 2 (y ŷ)2 = 1 2 (y θ Φ(x)) 2 Classification : y { 1,1}, prediction ŷ = sign(θ Φ(x)) loss of the form l(yθ Φ(x)) True 0-1 loss: l(yθ Φ(x)) = 1 yθ Φ(x)<0 Usual convex losses: hinge square logistic
20 Main motivating examples Support vector machine (hinge loss): non-smooth l(y,θ Φ(X)) = max{1 Yθ Φ(X),0} Logistic regression: smooth Least-squares regression l(y,θ Φ(X)) = log(1+exp( Yθ Φ(X))) l(y,θ Φ(X)) = 1 2 (Y θ Φ(X)) 2 Structured output regression See Tsochantaridis et al. (2005); Lacoste-Julien et al. (2013) 20
21 Usual regularizers Main goal: avoid overfitting (squared) Euclidean norm: θ 2 2 = d j=1 θ j 2 Numerically well-behaved Representer theorem and kernel methods : θ = n i=1 α iφ(x i ) See, e.g., Schölkopf and Smola (2001); Shawe-Taylor and Cristianini (2004) 21
22 Usual regularizers Main goal: avoid overfitting (squared) Euclidean norm: θ 2 2 = d j=1 θ j 2 Numerically well-behaved Representer theorem and kernel methods : θ = n i=1 α iφ(x i ) See, e.g., Schölkopf and Smola (2001); Shawe-Taylor and Cristianini (2004) Sparsity-inducing norms Main example: l 1 -norm θ 1 = d j=1 θ j Perform model selection as well as regularization Non-smooth optimization and structured sparsity See, e.g., Bach, Jenatton, Mairal, and Obozinski (2012b,a) 22
23 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ Φ(x) of features Φ(x) R d (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i,θ Φ(x i ) ) + µω(θ) θ R d n i=1 convex data fitting term + regularizer 23
24 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ Φ(x) of features Φ(x) R d (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i,θ Φ(x i ) ) + µω(θ) θ R d n i=1 convex data fitting term + regularizer Empirical risk: ˆf(θ) = 1 n n i=1 l(y i,θ Φ(x i )) training cost Expected risk: f(θ) = E (x,y) l(y,θ Φ(x)) testing cost Two fundamental questions: (1) computing ˆθ and (2) analyzing ˆθ May be tackled simultaneously 24
25 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ Φ(x) of features Φ(x) R d (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i,θ Φ(x i ) ) + µω(θ) θ R d n i=1 convex data fitting term + regularizer Empirical risk: ˆf(θ) = 1 n n i=1 l(y i,θ Φ(x i )) training cost Expected risk: f(θ) = E (x,y) l(y,θ Φ(x)) testing cost Two fundamental questions: (1) computing ˆθ and (2) analyzing ˆθ May be tackled simultaneously 25
26 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ Φ(x) of features Φ(x) R d (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i,θ Φ(x i ) ) such that Ω(θ) D θ R d n i=1 convex data fitting term + constraint Empirical risk: ˆf(θ) = 1 n n i=1 l(y i,θ Φ(x i )) training cost Expected risk: f(θ) = E (x,y) l(y,θ Φ(x)) testing cost Two fundamental questions: (1) computing ˆθ and (2) analyzing ˆθ May be tackled simultaneously 26
27 General assumptions Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Bounded features Φ(x) R d : Φ(x) 2 R Empirical risk: ˆf(θ) = 1 n n i=1 l(y i,θ Φ(x i )) training cost Expected risk: f(θ) = E (x,y) l(y,θ Φ(x)) testing cost Loss for a single observation: f i (θ) = l(y i,θ Φ(x i )) i, f(θ) = Ef i (θ) Properties of f i,f, ˆf Convex on R d Additional regularity assumptions: Lipschitz-continuity, smoothness and strong convexity 27
28 Convexity Global definitions g(θ) θ 28
29 Convexity Global definitions (full domain) g(θ) g(θ 2 ) g(θ 1 ) Not assuming differentiability: θ 2 θ 1 θ θ 1,θ 2,α [0,1], g(αθ 1 +(1 α)θ 2 ) αg(θ 1 )+(1 α)g(θ 2 ) 29
30 Convexity Global definitions (full domain) g(θ) g(θ 2 ) g(θ 1 ) Assuming differentiability: θ 2 θ 1 θ θ 1,θ 2, g(θ 1 ) g(θ 2 )+g (θ 2 ) (θ 1 θ 2 ) Extensions to all functions with subgradients / subdifferential 30
31 Subgradients and subdifferentials Given g : R d R convex g(θ ) g(θ) θ g(θ)+s (θ θ) s R d is a subgradient of g at θ if and only if θ R d,g(θ ) g(θ)+s (θ θ) Subdifferential g(θ) = set of all subgradients at θ If g is differentiable at θ, then g(θ) = {g (θ)} Example: absolute value The subdifferential is never empty! See Rockafellar (1997) θ 31
32 Convexity Global definitions (full domain) g(θ) θ Local definitions Twice differentiable functions θ, g (θ) 0 (positive semi-definite Hessians) 32
33 Convexity Global definitions (full domain) g(θ) θ Local definitions Twice differentiable functions θ, g (θ) 0 (positive semi-definite Hessians) Why convexity? 33
34 Why convexity? Local minimum = global minimum Optimality condition (non-smooth): 0 g(θ) Optimality condition (smooth): g (θ) = 0 Convex duality See Boyd and Vandenberghe (2003) Recognizing convex problems See Boyd and Vandenberghe (2003) 34
35 Lipschitz continuity Bounded gradients of g ( Lipschitz-continuity): the function g if convex, differentiable and has (sub)gradients uniformly bounded by B on the ball of center 0 and radius D: θ R d, θ 2 D g (θ) 2 B θ,θ R d, θ 2, θ 2 D g(θ) g(θ ) B θ θ 2 Machine learning with g(θ) = 1 n n i=1 l(y i,θ Φ(x i )) G-Lipschitz loss and R-bounded data: B = GR 35
36 Smoothness and strong convexity A function g : R d R is L-smooth if and only if it is differentiable and its gradient is L-Lipschitz-continuous θ 1,θ 2 R d, g (θ 1 ) g (θ 2 ) 2 L θ 1 θ 2 2 If g is twice differentiable: θ R d, g (θ) L Id smooth non smooth 36
37 Smoothness and strong convexity A function g : R d R is L-smooth if and only if it is differentiable and its gradient is L-Lipschitz-continuous θ 1,θ 2 R d, g (θ 1 ) g (θ 2 ) 2 L θ 1 θ 2 2 If g is twice differentiable: θ R d, g (θ) L Id Machine learning with g(θ) = 1 n n i=1 l(y i,θ Φ(x i )) Hessian covariance matrix 1 n n i=1 Φ(x i)φ(x i ) L loss -smooth loss and R-bounded data: L = L loss R 2 37
38 Smoothness and strong convexity A function g : R d R is µ-strongly convex if and only if θ 1,θ 2 R d, g(θ 1 ) g(θ 2 )+g (θ 2 ) (θ 1 θ 2 )+ µ 2 θ 1 θ If g is twice differentiable: θ R d, g (θ) µ Id convex strongly convex 38
39 Smoothness and strong convexity A function g : R d R is µ-strongly convex if and only if θ 1,θ 2 R d, g(θ 1 ) g(θ 2 )+g (θ 2 ) (θ 1 θ 2 )+ µ 2 θ 1 θ If g is twice differentiable: θ R d, g (θ) µ Id (large µ/l) (small µ/l) 39
40 Smoothness and strong convexity A function g : R d R is µ-strongly convex if and only if θ 1,θ 2 R d, g(θ 1 ) g(θ 2 )+g (θ 2 ) (θ 1 θ 2 )+ µ 2 θ 1 θ If g is twice differentiable: θ R d, g (θ) µ Id Machine learning with g(θ) = 1 n n i=1 l(y i,θ Φ(x i )) Hessian covariance matrix 1 n n i=1 Φ(x i)φ(x i ) Data with invertible covariance matrix (low correlation/dimension) 40
41 Smoothness and strong convexity A function g : R d R is µ-strongly convex if and only if θ 1,θ 2 R d, g(θ 1 ) g(θ 2 )+g (θ 2 ) (θ 1 θ 2 )+ µ 2 θ 1 θ If g is twice differentiable: θ R d, g (θ) µ Id Machine learning with g(θ) = 1 n n i=1 l(y i,θ Φ(x i )) Hessian covariance matrix 1 n n i=1 Φ(x i)φ(x i ) Data with invertible covariance matrix (low correlation/dimension) Adding regularization by µ 2 θ 2 creates additional bias unless µ is small 41
42 Summary of smoothness/convexity assumptions Bounded gradients of g (Lipschitz-continuity): the function g if convex, differentiable and has (sub)gradients uniformly bounded by B on the ball of center 0 and radius D: θ R d, θ 2 D g (θ) 2 B Smoothness of g: the function g is convex, differentiable with L-Lipschitz-continuous gradient g (e.g., bounded Hessians): θ 1,θ 2 R d, g (θ 1 ) g (θ 2 ) 2 L θ 1 θ 2 2 Strong convexity of g: The function g is strongly convex with respect to the norm, with convexity constant µ > 0: θ 1,θ 2 R d, g(θ 1 ) g(θ 2 )+g (θ 2 ) (θ 1 θ 2 )+ µ 2 θ 1 θ
43 Analysis of empirical risk minimization Approximation and estimation errors: Θ = {θ R d,ω(θ) D} [ ] [ ] f(ˆθ) min θ R df(θ) = f(ˆθ) min f(θ) + minf(θ) min θ Θ θ Θ θ R df(θ) Estimation error Approximation error NB: may replace min by best (non-linear) predictions θ Rdf(θ) 43
44 Analysis of empirical risk minimization Approximation and estimation errors: Θ = {θ R d,ω(θ) D} [ ] [ ] f(ˆθ) min θ R df(θ) = f(ˆθ) min f(θ) + minf(θ) min θ Θ θ Θ θ R df(θ) Estimation error Approximation error 1. Uniform deviation bounds, with ˆθ argmin θ Θ ˆf(θ) f(ˆθ) min θ Θ f(θ) = [ f(ˆθ) ˆf(ˆθ) ] + [ˆf(ˆθ) ˆf((θ ) Θ ) ] + [ˆf((θ ) Θ ) f((θ supf(θ) 0 +sup θ Θ θ Θ ˆf(θ) f(θ) 44
45 Analysis of empirical risk minimization Approximation and estimation errors: Θ = {θ R d,ω(θ) D} [ ] [ ] f(ˆθ) min θ R df(θ) = f(ˆθ) min f(θ) + minf(θ) min θ Θ θ Θ θ R df(θ) Estimation error Approximation error 1. Uniform deviation bounds, with ˆθ argmin θ Θ ˆf(θ) f(ˆθ) min θ Θ Typically slow rate O ( 1/ n ) f(θ) supf(θ) ˆf(θ)+sup θ Θ θ Θ ˆf(θ) f(θ) 2. More refined concentration results with faster rates O(1/n) 45
46 Analysis of empirical risk minimization Approximation and estimation errors: Θ = {θ R d,ω(θ) D} [ ] [ ] f(ˆθ) min θ R df(θ) = f(ˆθ) min f(θ) + minf(θ) min θ Θ θ Θ θ R df(θ) Estimation error Approximation error 1. Uniform deviation bounds, with ˆθ argmin θ Θ ˆf(θ) f(ˆθ) min θ Θ Typically slow rate O ( 1/ n ) f(θ) 2 sup f(θ) ˆf(θ) θ Θ 2. More refined concentration results with faster rates O(1/n) 46
47 Motivation from least-squares For least-squares, we have l(y,θ Φ(x)) = 1 2 (y θ Φ(x)) 2, and sup θ 2 D ˆf(θ) f(θ) = 1 ( 1 n 2 θ Φ(x i )Φ(x i ) EΦ(X)Φ(X) )θ n i=1 ( 1 n ) θ y i Φ(x i ) EYΦ(X) + 1 ( 1 n yi 2 EY ), 2 n 2 n i=1 i=1 1 n i )Φ(x i ) n i=1φ(x EΦ(X)Φ(X) op +D 1 n y i Φ(x i ) EYΦ(X) n n 2 2 yi 2 EY 2, n f(θ) ˆf(θ) D2 2 i=1 sup f(θ) ˆf(θ) O(1/ n) with high probability from 3 concentrations θ 2 D i=1 47
48 Slow rate for supervised learning Assumptions (f is the expected risk, ˆf the empirical risk) Ω(θ) = θ 2 (Euclidean norm) Linear predictors: θ(x) = θ Φ(x), with Φ(x) 2 R a.s. G-Lipschitz loss: f and ˆf are GR-Lipschitz on Θ = { θ 2 D} No assumptions regarding convexity 48
49 Slow rate for supervised learning Assumptions (f is the expected risk, ˆf the empirical risk) Ω(θ) = θ 2 (Euclidean norm) Linear predictors: θ(x) = θ Φ(x), with Φ(x) 2 R a.s. G-Lipschitz loss: f and ˆf are GR-Lipschitz on Θ = { θ 2 D} No assumptions regarding convexity With probability greater than 1 δ sup ˆf(θ) f(θ) l 0+GRD [2+ θ Θ n 2log 2 δ Expectated estimation error: E [ sup ˆf(θ) f(θ) ] 4l 0+4GRD θ Θ n Using Rademacher averages (see, e.g., Boucheron et al., 2005) Lipschitz functions slow rate ] 49
50 Symmetrization with Rademacher variables Let D = {x 1,y 1,...,x n,y n} an independent copy of the data D = {x 1,y 1,...,x n,y n }, with corresponding loss functions f i (θ) E [ supf(θ) ˆf(θ) ] = E [ sup θ Θ θ Θ [ = E [ E E sup θ Θ [ [ = E sup θ Θ [ = E sup [ 2E θ Θ ( f(θ) 1 n n 1 n sup θ Θ 1 n 1 n sup θ Θ 1 n i=1 n i=1 ) ] f i (θ) E ( f i(θ) f i (θ) D ) ( f i (θ) f i (θ) ) ]] D 1 n n i=1 n ( f i (θ) f i (θ) )] i=1 n ( ε i f i (θ) f i (θ) )] with ε i uniform in { 1,1} i=1 n i=1 ] ε i f i (θ) = Rademacher complexity 50
51 Rademacher complexity Rademacher complexity of the class of functions (X,Y) l(y,θ Φ(X)) [ n ] R n = E ε i f i (θ) sup θ Θ 1 n i=1 with f i (θ) = l(x i,θ Φ(x i )), (x i,y i ), i.i.d NB 1: two expectations, with respect to D and with respect to ε Empirical Rademacher average ˆR n by conditioning on D 1 n NB 2: sometimes defined as sup θ Θ n i=1 ε if i (θ) Main property: E [ supf(θ) ˆf(θ) ] and E [ sup θ Θ θ Θ ˆf(θ) f(θ) ] 2R n 51
52 From Rademacher complexity to uniform bound Let Z = sup θ Θ f(θ) ˆf(θ) By changing the pair (x i,y i ), Z may only change by 2 n sup l(y,θ Φ(X)) 2 ( ) 2( sup l(y,0) +GRD l0 +GRD ) = c n n with sup l(y,0) = l 0 MacDiarmid inequality: with probability greater than 1 δ, Z EZ + n 2 c log 1 δ 2R n+ 2 n ( l0 +GRD ) log 1 δ 52
53 Bounding the Rademacher average - I Wehave, withϕ i (u) = l(y i,u) l(y i,0)isalmostsurelyg-lipschitz: ˆR n = E ε [sup θ Θ E ε [sup θ Θ 1 n 1 n n i=1 n i=1 0+E ε [sup θ Θ = 0+E ε [sup θ Θ 1 n 1 n ] ε i f i (θ) [ ε i f i (0) ]+E ε sup θ Θ 1 n n [ ε i fi (θ) f i (0) ]] i=1 n i=1 ] ε i ϕ i (θ Φ(x i )) n [ ε i fi (θ) f i (0) ]] Using Ledoux-Talagrand contraction results for Rademacher averages (since ϕ i is G-Lipschitz), we get (Meir and Zhang, 2003): ˆR n 2G E ε [ sup θ 2 D 1 n n i=1 i=1 ] ε i θ Φ(x i ) 53
54 Proof of Ledoux-Talagrand lemma (Meir and Zhang, 2003, Lemma 5) Given any b, a i : Θ R (no assumption) and ϕ i : R R any 1-Lipschitz-functions, i = 1,...,n E ε [supb(θ)+ θ Θ n i=1 Proof by induction on n n = 0: trivial From n to n+1: see next slide ] [ ε i ϕ i (a i (θ)) E ε supb(θ)+ θ Θ n i=1 ] ε i a i (θ) 54
55 From n to n+1 E ε1,...,ε n+1 [sup θ Θ = E ε1,...,εn = E ε1,...,εn E ε1,...,εn [ [ [ sup θ,θ Θ sup θ,θ Θ sup θ,θ Θ = E ε1,...,εne εn+1 [sup θ Θ E ε1,...,εn,ε n+1 [sup θ Θ b(θ) + n+1 i=1 b(θ) + b(θ ) 2 b(θ) + b(θ ) 2 b(θ) + b(θ ) 2 ] ε i ϕ i (a i (θ)) n i=1 n i=1 n i=1 b(θ) + ε n+1 a n+1 (θ) + b(θ) + ε n+1 a n+1 (θ) + ε i ϕ i (a i (θ)) + ϕ i (a i (θ )) 2 ε i ϕ i (a i (θ)) + ϕ i (a i (θ )) 2 ε i ϕ i (a i (θ)) + ϕ i (a i (θ )) 2 n i=1 ] ε i ϕ i (a i (θ)) n ] ε i a i (θ) by recursion i=1 + ϕ n+1(a n+1 (θ)) ϕ n+1 (a n+1 (θ 2 + ϕ n+1(a n+1 (θ)) ϕ n+1 (a n+1 (θ 2 + a n+1(θ) a n+1 (θ ] ) 2 55
56 We have: Bounding the Rademacher average - II [ R n 2GE = 2GE D1 n 2GD sup θ 2 D n i=1 E 1 n 1 n n i=1 ε i Φ(x i ) ] ε i θ Φ(x i ) 2 n ε i Φ(x i ) i=1 2 2 by Jensen s inequality 2GRD n by using Φ(x) 2 R and independence Overall, we get, with probability 1 δ: sup f(θ) ˆf(θ) 1 ( l0 +GRD)(4+ θ Θ n 2log 1 δ ) 56
57 We have, with probability 1 δ Putting it all together For exact minimizer ˆθ argmin θ Θ ˆf(θ), we have f(ˆθ) min θ Θ f(θ) sup θ Θ For inexact minimizer η Θ f(η) min θ Θ ˆf(θ) f(θ)+sup θ Θ 2 n ( l0 +GRD)(4+ f(θ) ˆf(θ) 2log 1 δ f(θ) 2 sup ˆf(θ) f(θ) + [ˆf(η) ] ˆf(ˆθ) θ Θ Only need to optimize with precision 2 n (l 0 +GRD) ) 57
58 We have, with probability 1 δ Putting it all together For exact minimizer ˆθ argmin θ Θ ˆf(θ), we have f(ˆθ) min θ Θ For inexact minimizer η Θ f(η) min θ Θ f(θ) 2 sup ˆf(θ) f(θ) θ Θ 2 n ( l0 +GRD)(4+ 2log 1 δ f(θ) 2 sup ˆf(θ) f(θ) + [ˆf(η) ] ˆf(ˆθ) θ Θ Only need to optimize with precision 2 n (l 0 +GRD) ) 58
59 Slow rate for supervised learning (summary) Assumptions (f is the expected risk, ˆf the empirical risk) Ω(θ) = θ 2 (Euclidean norm) Linear predictors: θ(x) = θ Φ(x), with Φ(x) 2 R a.s. G-Lipschitz loss: f and ˆf are GR-Lipschitz on Θ = { θ 2 D} No assumptions regarding convexity With probability greater than 1 δ sup ˆf(θ) f(θ) (l 0+GRD) [2+ θ Θ n 2log 2 δ Expectated estimation error: E [ sup ˆf(θ) f(θ) ] 4(l 0+GRD) θ Θ n Using Rademacher averages (see, e.g., Boucheron et al., 2005) Lipschitz functions slow rate ] 59
60 Motivation from mean estimation Estimator ˆθ = 1 n n i=1 z i = argmin θ R 1 2n n i=1 (θ z i) 2 = ˆf(θ) From before: f(θ) = 1 2 E(θ z)2 = 1 2 (θ Ez) var(z) = ˆf(θ)+O(1/ n) f(ˆθ) = 1 2 (ˆθ Ez) var(z) = f(ez)+o(1/ n) 60
61 Motivation from mean estimation Estimator ˆθ = 1 n n i=1 z i = argmin θ R 1 2n n i=1 (θ z i) 2 = ˆf(θ) θ = Ez = argmin θ R 1 2 E(θ z)2 = f(θ) From before (estimation error): f(ˆθ) f(θ ) = O(1/ n) Direct computation: f(θ) = 1 2 E(θ z)2 = 1 2 (θ Ez) var(z) More refined/direct bound: f(ˆθ) f(ez) = 1 2 (ˆθ Ez) 2 E [ f(ˆθ) f(ez) ] = 1 2 E ( 1 n n i=1 ) 2 z i Ez = 1 2n var(z) Bound only at ˆθ + strong convexity (instead of uniform bound) 61
62 Fast rate for supervised learning Assumptions (f is the expected risk, ˆf the empirical risk) Same as before (bounded features, Lipschitz loss) Regularized risks: f µ (θ) = f(θ)+ µ 2 θ 2 2 and ˆf µ (θ) = ˆf(θ)+ µ 2 θ 2 2 Convexity For any a > 0, with probability greater than 1 δ, for all θ R d, f µ (ˆθ) min (η) 8G2 R 2 (32+log 1 δ ) η R dfµ µn Results from Sridharan, Srebro, and Shalev-Shwartz (2008) see also Boucheron and Massart (2011) and references therein Strongly convex functions fast rate Warning: µ should decrease with n to reduce approximation error 62
63 1. Introduction Outline - II Large-scale machine learning and optimization Classes of functions (convex, smooth, etc.) Traditional statistical analysis through Rademacher complexity 2. Classical methods for convex optimization Smooth optimization (gradient descent, Newton method) Non-smooth optimization (subgradient descent) Proximal methods 3. Non-smooth stochastic approximation Stochastic (sub)gradient and averaging Non-asymptotic results and lower bounds Strongly convex vs. non-strongly convex 63
64 Outline - II 4. Classical stochastic approximation Asymptotic analysis Robbins-Monro algorithm Polyak-Rupert averaging 5. Smooth stochastic approximation algorithms Non-asymptotic analysis for smooth functions Logistic regression Least-squares regression without decaying step-sizes 6. Finite data sets Gradient methods with exponential convergence rates Convex duality (Dual) stochastic coordinate descent - Frank-Wolfe 64
65 Complexity results in convex optimization Assumption: g convex on R d Classical generic algorithms Gradient descent and accelerated gradient descent Newton method Subgradient method and ellipsoid algorithm 65
66 Complexity results in convex optimization Assumption: g convex on R d Classical generic algorithms Gradient descent and accelerated gradient descent Newton method Subgradient method and ellipsoid algorithm Key additional properties of g Lipschitz continuity, smoothness or strong convexity Key insight from Bottou and Bousquet (2008) In machine learning, no need to optimize below estimation error Key references: Nesterov (2004), Bubeck (2015) 66
67 Several criteria for characterizing convergence Objective function values Usually weaker condition Iterates g(θ) inf η R dg(η) inf η argming θ η 2 Typically used for strongly-convex problems NB 1: relationships between the two types in several situations (see later) NB 2: similarity with prediction vs. estimation in statistics 67
68 (smooth) gradient descent Assumptions g convex with L-Lipschitz-continuous gradient (e.g., L-smooth) Minimum attained at θ Algorithm: θ t = θ t 1 1 L g (θ t 1 ) 68
69 (smooth) gradient descent - strong convexity Assumptions g convex with L-Lipschitz-continuous gradient (e.g., L-smooth) g µ-strongly convex Algorithm: Bound: θ t = θ t 1 1 L g (θ t 1 ) g(θ t ) g(θ ) (1 µ/l) t[ g(θ 0 ) g(θ ) ] Three-line proof Line search, steepest descent or constant step-size 69
70 Assumptions (smooth) gradient descent - slow rate g convex with L-Lipschitz-continuous gradient (e.g., L-smooth) Minimum attained at θ Algorithm: Bound: Four-line proof θ t = θ t 1 1 L g (θ t 1 ) g(θ t ) g(θ ) 2L θ 0 θ 2 t+4 Adaptivity of gradient descent to problem difficulty Not best possible convergence rates after O(d) iterations 70
71 Gradient descent - Proof for quadratic functions Quadratic convex function: g(θ) = 1 2 θ Hθ c θ µ and L are smallest largest eigenvalues of H Global optimum θ = H 1 c (or H c) Gradient descent: θ t = θ t 1 1 L (Hθ t 1 c) = θ t 1 1 L (Hθ t 1 Hθ ) θ t θ = (I 1 L H)(θ t 1 θ ) = (I 1 L H)t (θ 0 θ ) Strong convexity µ > 0: eigenvalues of (I 1 L H)t in [0,(1 µ L )t ] Convergence of iterates: θ t θ 2 (1 µ/l) 2t θ 0 θ 2 Function values: g(θ t ) g(θ ) (1 µ/l) 2t[ g(θ 0 ) g(θ ) ] Function values: g(θ t ) g(θ ) L t θ 0 θ 2 71
72 Gradient descent - Proof for quadratic functions Quadratic convex function: g(θ) = 1 2 θ Hθ c θ µ and L are smallest largest eigenvalues of H Global optimum θ = H 1 c (or H c) Gradient descent: θ t = θ t 1 1 L (Hθ t 1 c) = θ t 1 1 L (Hθ t 1 Hθ ) θ t θ = (I 1 L H)(θ t 1 θ ) = (I 1 L H)t (θ 0 θ ) Convexity µ = 0: eigenvalues of (I 1 L H)t in [0,1] No convergence of iterates: θ t θ 2 θ 0 θ 2 Function values: g(θ t ) g(θ ) max v [0,L] v(1 v/l) 2t θ 0 θ 2 Function values: g(θ t ) g(θ ) L t θ 0 θ 2 72
73 Properties of smooth convex functions Let g : R d R a convex L-smooth function. Then for all θ,η R d : Definition: g (θ) g (η) L θ η If twice differentiable: 0 g (θ) LI Quadratic upper-bound: 0 g(θ) g(η) g (η) (θ η) L 2 θ η 2 Taylor expansion with integral remainder Co-coercivity: 1 L g (θ) g (η) 2 [ g (θ) g (η) ] (θ η) If g is µ-strongly convex (no need for smoothness), then g(θ) g(η)+g (η) (θ η)+ 1 2µ g (θ) g (η) 2 Distance to optimum: g(θ) g(θ ) g (θ) (θ θ ) 73
74 Proof of co-coercivity Quadratic upper-bound: 0 g(θ) g(η) g (η) (θ η) L 2 θ η 2 Taylor expansion with integral remainder Lower bound: g(θ) g(η)+g (η) (θ η)+ 1 2L g (θ) g (η) 2 Define h(θ) = g(θ) θ g (η), convex with global minimum at η h(η) h(θ 1 L h (θ)) h(θ)+h (θ) ( 1 L h (θ)))+ L 2 1 L h (θ)) 2, which is thus less than h(θ) 1 2L h (θ) 2 Thus g(η) η g (η) g(θ) θ g (η) 1 2L g (θ) g (η) 2 Proof of co-coercivity Apply lower bound twice for (η,θ) and (θ,η), and sum to get 0 [g (η) g (θ)] (θ η)+ 1 L g (θ) g (η) 2 NB: simple proofs with second-order derivatives 74
75 Proof of g(θ) g(η)+g (η) (θ η)+ 1 2µ g (θ) g (η) 2 Define h(θ) = g(θ) θ g (η), convex with global minimum at η h(η) = min θ h(θ) min ζ h(θ)+h (θ) (ζ θ)+ µ 2 ζ θ 2, which is attained for ζ θ = 1 µ h (θ) This leads to h(η) h(θ) 1 2µ h (θ) 2 Hence, g(η) η g (η) g(θ) θ g (η) 1 2µ g (η) (θ) 2 NB: no need for smooothness NB: simple proofs with second-order derivatives With η = θ global minimizer, another distance to optimum g(θ) g(θ ) 1 2µ g (θ) 2 Polyak-Lojasiewicz 75
76 Convergence proof - gradient descent smooth strongly convex functions Iteration: θ t = θ t 1 γg (θ t 1 ) with γ = 1/L g(θ t ) = g [ θ t 1 γg (θ t 1 ) ] g(θ t 1 )+g (θ t 1 ) [ γg (θ t 1 ) ] + L 2 γg (θ t 1 ) 2 = g(θ t 1 ) γ(1 γl/2) g (θ t 1 ) 2 = g(θ t 1 ) 1 2L g (θ t 1 ) 2 if γ = 1/L, g(θ t 1 ) µ L[ g(θt 1 ) g(θ ) ] using strongly-convex distance to optimum Thus, g(θ t ) g(θ ) (1 µ/l) t[ g(θ 0 ) g(θ ) ] May also get (Nesterov, 2004): θ t θ 2 ( 1 2γµL ) t θ0 µ+l θ 2 as soon as γ 2 µ+l 76
77 Convergence proof - gradient descent smooth convex functions - II Iteration: θ t = θ t 1 γg (θ t 1 ) with γ = 1/L θ t θ 2 = θ t 1 θ γg (θ t 1 ) 2 = θ t 1 θ 2 +γ 2 g (θ t 1 ) 2 2γ(θ t 1 θ ) g (θ t 1 ) θ t 1 θ 2 +γ 2 g (θ t 1 ) 2 2 γ L g (θ t 1 ) 2 using co-coercivity = θ t 1 θ 2 γ(2/l γ) g (θ t 1 ) 2 θ t 1 θ 2 if γ 2/L θ 0 θ 2 : bounded iterates g(θ t ) g(θ t 1 ) 1 2L g (θ t 1 ) 2 (see previous slide) g(θ t 1 ) g(θ ) g (θ t 1 ) (θ t 1 θ ) g (θ t 1 ) θ t 1 θ (Cauchy-Schwarz) 1 [ g(θ t ) g(θ ) g(θ t 1 ) g(θ ) g(θt 1 2L θ 0 θ 2 ) g(θ ) ] 2 77
78 Convergence proof - gradient descent smooth convex functions - II Iteration: θ t = θ t 1 γg (θ t 1 ) with γ = 1/L g(θ t ) g(θ ) g(θ t 1 ) g(θ ) 1 2L θ 0 θ 2 [ g(θt 1 ) g(θ ) ] 2 of the form k k 1 α 2 k 1 with 0 k = g(θ k ) g(θ ) L 2 θ k θ 2 1 k 1 1 k 1 1 k α k 1 k by dividing by k k 1 1 k α because ( k ) is non-increasing 1 1 αt by summing from k = 1 to t 0 t 0 t by inverting 1+αt 0 2L θ 0 θ 2 t+4 since 0 L 2 θ k θ 2 and α = 1 2L θ 0 θ 2 78
79 Limits on convergence rate of first-order methods First-order method: any iterative algorithm that selects θ t in θ 0 +span(f (θ 0 ),...,f (θ t 1 )) Problem class: minimizer θ convex L-smooth functions with a global Theorem: for every integer t (d 1)/2 and every θ 0, there exist functions in the problem class such that for any first-order method, g(θ t ) g(θ ) 3 L θ 0 θ 2 32 (t+1) 2 O(1/t) rate for gradient method may not be optimal! 79
80 Limits on convergence rate of first-order methods Proof sketch Define quadratic function g t (θ) = L 8[ (θ 1 ) 2 + t 1 (θ i θ i+1 ) 2 +(θ t ) 2 2θ 1] i=1 Fact 1: g t is L-smooth Fact 2: minimizer supported by first t coordinates (closed form) Fact 3: any first-order method starting from zero will be supported in the first k coordinates after iteration k Fact 4: the minimum over this support in {1,...,k} may be computed in closed form Given iteration k, take g = g 2k+1 and compute lower-bound on g(θ k ) g(θ ) θ 0 θ 2 80
81 Accelerated gradient methods (Nesterov, 1983) Assumptions g convex with L-Lipschitz-cont. gradient, min. attained at θ Algorithm: θ t = η t 1 1 L g (η t 1 ) Bound: η t = θ t + t 1 t+2 (θ t θ t 1 ) g(θ t ) g(θ ) 2L θ 0 θ 2 (t+1) 2 Ten-line proof (see, e.g., Schmidt, Le Roux, and Bach, 2011) Not improvable Extension to strongly-convex functions 81
82 Accelerated gradient methods - strong convexity Assumptions g convex with L-Lipschitz-cont. gradient, min. attained at θ g µ-strongly convex Algorithm: θ t = η t 1 1 L g (η t 1 ) η t = θ t + 1 µ/l 1+ µ/l (θ t θ t 1 ) Bound: g(θ t ) f(θ ) L θ 0 θ 2 (1 µ/l) t Ten-line proof (see, e.g., Schmidt, Le Roux, and Bach, 2011) Not improvable Relationship with conjugate gradient for quadratic functions 82
83 Optimization for sparsity-inducing norms (see Bach, Jenatton, Mairal, and Obozinski, 2012b) Gradient descent as a proximal method (differentiable functions) θ t+1 = arg min θ R df(θ t)+(θ θ t ) f(θ t )+ L 2 θ θ t 2 2 θ t+1 = θ t 1 L f(θ t) 83
84 Optimization for sparsity-inducing norms (see Bach, Jenatton, Mairal, and Obozinski, 2012b) Gradient descent as a proximal method (differentiable functions) θ t+1 = arg min θ R df(θ t)+(θ θ t ) f(θ t )+ L 2 θ θ t 2 2 θ t+1 = θ t 1 L f(θ t) Problems of the form: min θ R df(θ)+µω(θ) θ t+1 = arg min θ R df(θ t)+(θ θ t ) f(θ t )+µω(θ)+ L 2 θ θ t 2 2 Ω(θ) = θ 1 Thresholded gradient descent Similar convergence rates than smooth optimization Acceleration methods (Nesterov, 2007; Beck and Teboulle, 2009) 84
85 Soft-thresholding for the l 1 -norm 1 Example 1: quadratic problem in 1D, i.e. min x R 2 x2 xy +λ x Piecewise quadratic function with a kink at zero Derivative at 0+: g + = λ y and 0 : g = λ y x = 0 is the solution iff g + 0 and g 0 (i.e., y λ) x 0 is the solution iff g + 0 (i.e., y λ) x = y λ x 0 is the solution iff g 0 (i.e., y λ) x = y +λ Solution x = sign(y)( y λ) + = soft thresholding 85
86 Soft-thresholding for the l 1 -norm 1 Example 1: quadratic problem in 1D, i.e. min x R 2 x2 xy +λ x Piecewise quadratic function with a kink at zero Solution x = sign(y)( y λ) + = soft thresholding x x*(y) λ λ y 86
87 Projected gradient descent Problems of the form: min θ K f(θ) θ t+1 = argmin f(θ t)+(θ θ t ) f(θ t )+ L θ K 2 θ θ t θ t+1 = argmin θ ( θ t 1 θ K 2 L f(θ t) ) 2 2 Projected gradient descent Similar convergence rates than smooth optimization Acceleration methods (Nesterov, 2007; Beck and Teboulle, 2009) 87
88 Newton method Given θ t 1, minimize second-order Taylor expansion g(θ) = g(θ t 1 )+g (θ t 1 ) (θ θ t 1 )+ 1 2 (θ θ t 1) g (θ t 1 ) (θ θ t 1 ) Expensive Iteration: θ t = θ t 1 g (θ t 1 ) 1 g (θ t 1 ) Running-time complexity: O(d 3 ) in general Quadratic convergence: If θ t 1 θ small enough, for some constant C, we have (C θ t θ ) = (C θ t 1 θ ) 2 See Boyd and Vandenberghe (2003) 88
89 Summary: minimizing smooth convex functions Assumption: g convex Gradient descent: θ t = θ t 1 γ t g (θ t 1 ) O(1/t) convergence rate for smooth convex functions O(e tµ/l ) convergence rate for strongly smooth convex functions Optimal rates O(1/t 2 ) and O(e t µ/l ) Newton method: θ t = θ t 1 f (θ t 1 ) 1 f (θ t 1 ) O ( e ρ2t ) convergence rate 89
90 Summary: minimizing smooth convex functions Assumption: g convex Gradient descent: θ t = θ t 1 γ t g (θ t 1 ) O(1/t) convergence rate for smooth convex functions O(e tµ/l ) convergence rate for strongly smooth convex functions Optimal rates O(1/t 2 ) and O(e t µ/l ) Newton method: θ t = θ t 1 f (θ t 1 ) 1 f (θ t 1 ) O ( e ρ2t ) convergence rate From smooth to non-smooth Subgradient method and ellipsoid 90
91 Counter-example (Bertsekas, 1999) Steepest descent for nonsmooth objectives g(θ 1,θ 2 ) = { 5(9θ θ 2 2) 1/2 if θ 1 > θ 2 (9θ θ 2 ) 1/2 if θ 1 θ 2 Steepest descent starting from any θ such that θ 1 > θ 2 > (9/16) 2 θ
92 Subgradient method/ descent (Shor et al., 1985) Assumptions g convex and B-Lipschitz-continuous on { θ 2 D} ( Algorithm: θ t = Π D θ t 1 2D ) B t g (θ t 1 ) Π D : orthogonal projection onto { θ 2 D} Constraints 92
93 Subgradient method/ descent (Shor et al., 1985) Assumptions g convex and B-Lipschitz-continuous on { θ 2 D} ( Algorithm: θ t = Π D θ t 1 2D ) B t g (θ t 1 ) Π D : orthogonal projection onto { θ 2 D} Bound: Three-line proof ( 1 g t t 1 k=0 θ k ) g(θ ) 2DB t Best possible convergence rate after O(d) iterations (Bubeck, 2015) 93
94 Subgradient method/ descent - proof - I Iteration: θ t = Π D (θ t 1 γ t g (θ t 1 )) with γ t = 2D B t Assumption: g (θ) 2 B and θ 2 D θ t θ 2 2 θ t 1 θ γ t g (θ t 1 ) 2 2 by contractivity of projections θ t 1 θ 2 2+B 2 γ 2 t 2γ t (θ t 1 θ ) g (θ t 1 ) because g (θ t 1 ) 2 B θ t 1 θ 2 2+B 2 γ 2 t 2γ t [ g(θt 1 ) g(θ ) ] (property of subgradients) leading to g(θ t 1 ) g(θ ) B2 γ t [ θt 1 θ 2 2γ 2 θ t θ 2 ] 2 t 94
95 Subgradient method/ descent - proof - II Starting from g(θ t 1 ) g(θ ) B2 γ t 2 Constant step-size γ t = γ t [ g(θu 1 ) g(θ ) ] u=1 t u=1 B 2 γ 2 + t u=1 + 1 [ θt 1 θ 2 2γ 2 θ t θ 2 ] 2 t 1 [ θu 1 θ 2 2γ 2 θ u θ 2 2] t B2 γ γ θ 0 θ 2 2 t B2 γ γ D2 Optimized step-size γ t = 2D B t depends on horizon Leads to bound of 2DB t ( 1 t 1 ) Using convexity: g θ k g(θ ) 1 t t k=0 t 1 k=0 g(θ k ) g(θ ) 2DB t 95
96 Subgradient method/ descent - proof - III Starting from Decreasing step-size t [ g(θu 1 ) g(θ ) ] u=1 g(θ t 1 ) g(θ ) B2 γ t 2 = = t u=1 t u=1 B 2 γ u 2 t u=1 t u=1 B 2 γ u 2 B 2 γ u 2 B 2 γ u 2 + u=1 t u=1 + 1 [ θt 1 θ 2 2γ 2 θ t θ 2 ] 2 t 1 [ θu 1 θ 2 2γ 2 θ u θ 2 ] 2 u t 1 ( + θ u θ ) θ 0 θ θ t θ 2 2 2γ u+1 2γ u 2γ 1 2γ t + t 1 u=1 4D 2( 1 2γ u+1 1 2γ u ) + 4D 2 2γ 1 + 4D2 2γ t 3DB t with γ t = 2D B t Using convexity: g ( 1 t t 1 k=0 θ ) k g(θ ) 3DB t 96
97 Subgradient descent for machine learning Assumptions (f is the expected risk, ˆf the empirical risk) Linear predictors: θ(x) = θ Φ(x), with Φ(x) 2 R a.s. ˆf(θ) = 1 n n i=1 l(y i,φ(x i ) θ) G-Lipschitz loss: f and ˆf are GR-Lipschitz on Θ = { θ 2 D} Statistics: with probability greater than 1 δ sup ˆf(θ) f(θ) GRD [2+ 2log 2 θ Θ n δ ] Optimization: after t iterations of subgradient method ˆf(ˆθ) min η Θ ˆf(η) GRD t t = n iterations, with total running-time complexity of O(n 2 d) 97
98 Subgradient descent - strong convexity Assumptions g convex and B-Lipschitz-continuous on { θ 2 D} g µ-strongly convex ) 2 Algorithm: θ t = Π D (θ t 1 µ(t+1) g (θ t 1 ) Bound: ( 2 g t(t+1) t k=1 kθ k 1 ) g(θ ) 2B2 µ(t+1) Three-line proof Best possible convergence rate after O(d) iterations (Bubeck, 2015) 98
99 Subgradient method - strong convexity - proof - I Iteration: θ t = Π D (θ t 1 γ t g (θ t 1 )) with γ t = 2 µ(t+1) Assumption: g (θ) 2 B and θ 2 D and µ-strong convexity of f θ t θ 2 2 θ t 1 θ γ t g (θ t 1 ) 2 2 by contractivity of projections θ t 1 θ 2 2+B 2 γt 2 2γ t (θ t 1 θ ) g (θ t 1 ) because g (θ t 1 ) 2 θ t 1 θ 2 2+B 2 γt 2 [ 2γ t g(θt 1 ) g(θ )+ µ 2 θ t 1 θ 2 ] 2 (property of subgradients and strong convexity) leading to g(θ t 1 ) g(θ ) B2 γ t B 2 µ(t+1) + µ 2 [1 γ t µ ] θ t 1 θ γ t θ t θ 2 2 [t 1 2 ] θt 1 θ 2 2 µ(t+1) θ t θ
100 Subgradient method - strong convexity - proof - II From g(θ t 1 ) g(θ ) B2 µ(t+1) + µ 2 [t 1 2 ] θt 1 θ 2 2 µ(t+1) θ t θ t u [ g(θ u 1 ) g(θ ) ] u=1 u t=1 B 2 u µ(u+1) t [ u(u 1) θu 1 θ 2 2 u(u+1) θ u θ 2 2] u=1 B2 t µ + 1 [ 0 t(t+1) θt θ 2 B 4 2] 2 t µ ( 2 Using convexity: g t(t+1) t u=1 uθ u 1 ) g(θ ) 2B2 t+1 NB: with step-size γ n = 1/(nµ), extra logarithmic factor 100
101 Ellipsoid method Minimizing convex function g : R d R Builds a sequence of ellipsoids that contains the global minima. E 0 E 1 E 2 E 1 Represent E t = {θ R d,(θ θ t ) P 1 t (θ θ t ) 1} Fact 1: θ t+1 = θ t 1 d+1 P th t and P t+1 = d2 d 2 1 (P t 2 d+1 P th t h t P t ) 1 with h t = g (θ t ) P t g (x t ) g (θ t ) Fact 2: vol(e t ) vol(e t 1 )e 1/2d CV rate in O(e t/d2 ) 101
102 Summary: minimizing convex functions Gradient descent: θ t = θ t 1 γ t g (θ t 1 ) O(1/ t) convergence rate for non-smooth convex functions O(1/t) convergence rate for smooth convex functions O(e ρt ) convergence rate for strongly smooth convex functions Newton method: θ t = θ t 1 g (θ t 1 ) 1 g (θ t 1 ) O ( e ρ2t ) convergence rate - Key insights from Bottou and Bousquet (2008) - In machine learning, no need to optimize below statistical error - In machine learning, cost functions are averages - Testing errors are more important than training errors Stochastic approximation 102
103 Summary: minimizing convex functions Gradient descent: θ t = θ t 1 γ t g (θ t 1 ) O(1/ t) convergence rate for non-smooth convex functions O(1/t) convergence rate for smooth convex functions O(e ρt ) convergence rate for strongly smooth convex functions Newton method: θ t = θ t 1 g (θ t 1 ) 1 g (θ t 1 ) O ( e ρ2t ) convergence rate Key insights from Bottou and Bousquet (2008) 1. In machine learning, no need to optimize below statistical error 2. In machine learning, cost functions are averages 3. Testing errors are more important than training errors Stochastic approximation 103
104 Problem parameters Summary of rates of convergence D diameter of the domain B Lipschitz-constant L smoothness constant µ strong convexity constant convex strongly convex nonsmooth deterministic: BD/ t deterministic: B 2 /(tµ) stochastic: BD/ n stochastic: B 2 /(nµ) smooth deterministic: LD 2 /t 2 deterministic: exp( t µ/l) stochastic: LD 2 / n stochastic: L/(nµ) finite sum: n/t finite sum: exp( min{1/n, µ/l} quadratic deterministic: LD 2 /t 2 deterministic: exp( t µ/l) stochastic: d/n+ld 2 /n stochastic: d/n+ld 2 /n 104
105 1. Introduction Outline - II Large-scale machine learning and optimization Classes of functions (convex, smooth, etc.) Traditional statistical analysis through Rademacher complexity 2. Classical methods for convex optimization Smooth optimization (gradient descent, Newton method) Non-smooth optimization (subgradient descent) Proximal methods 3. Non-smooth stochastic approximation Stochastic (sub)gradient and averaging Non-asymptotic results and lower bounds Strongly convex vs. non-strongly convex 105
106 Outline - II 4. Classical stochastic approximation Asymptotic analysis Robbins-Monro algorithm Polyak-Rupert averaging 5. Smooth stochastic approximation algorithms Non-asymptotic analysis for smooth functions Logistic regression Least-squares regression without decaying step-sizes 6. Finite data sets Gradient methods with exponential convergence rates Convex duality (Dual) stochastic coordinate descent - Frank-Wolfe 106
107 Stochastic approximation Goal: Minimizing a function f defined on R d given only unbiased estimates f n(θ n ) of its gradients f (θ n ) at certain points θ n R d 107
108 Stochastic approximation Goal: Minimizing a function f defined on R d given only unbiased estimates f n(θ n ) of its gradients f (θ n ) at certain points θ n R d Machine learning - statistics loss for a single pair of observations: f n (θ) = l(y n,θ Φ(x n )) f(θ) = Ef n (θ) = El(y n,θ Φ(x n )) = generalization error Expected gradient: f (θ) = Ef n(θ) = E { l (y n,θ Φ(x n ))Φ(x n ) } Non-asymptotic results Number of iterations = number of observations 108
109 Stochastic approximation Goal: Minimizing a function f defined on R d given only unbiased estimates f n(θ n ) of its gradients f (θ n ) at certain points θ n R d Stochastic approximation (much) broader applicability beyond convex optimization θ n = θ n 1 γ n h n (θ n 1 ) with E [ h n (θ n 1 ) θ n 1 ] = h(θn 1 ) Beyond convex problems, i.i.d assumption, finite dimension, etc. Typically asymptotic results (see next lecture) See, e.g., Kushner and Yin (2003); Benveniste et al. (2012) 109
110 Stochastic approximation Relationship to online learning Minimize f(θ) = E z l(θ,z) = generalization error of θ Using the gradients of single i.i.d. observations 110
111 Stochastic approximation Relationship to online learning Minimize f(θ) = E z l(θ,z) = generalization error of θ Using the gradients of single i.i.d. observations Batch learning Finite set of observations: z 1,...,z n Empirical risk: ˆf(θ) = 1 n n k=1 l(θ,z i) Estimator ˆθ = Minimizer of ˆf(θ) over a certain class Θ Generalization bound using uniform concentration results 111
112 Stochastic approximation Relationship to online learning Minimize f(θ) = E z l(θ,z) = generalization error of θ Using the gradients of single i.i.d. observations Batch learning Finite set of observations: z 1,...,z n Empirical risk: ˆf(θ) = 1 n n k=1 l(θ,z i) Estimator ˆθ = Minimizer of ˆf(θ) over a certain class Θ Generalization bound using uniform concentration results Online learning Update ˆθ n after each new (potentially adversarial) observation z n Cumulative loss: 1 n n k=1 l(ˆθ k 1,z k ) Online to batch through averaging (Cesa-Bianchi et al., 2004) 112
113 Convex stochastic approximation Key properties of f and/or f n Smoothness: f B-Lipschitz continuous, f L-Lipschitz continuous Strong convexity: f µ-strongly convex 113
114 Convex stochastic approximation Key properties of f and/or f n Smoothness: f B-Lipschitz continuous, f L-Lipschitz continuous Strong convexity: f µ-strongly convex Key algorithm: Stochastic gradient descent (a.k.a. Robbins-Monro) θ n = θ n 1 γ n f n(θ n 1 ) Polyak-Ruppert averaging: θ n = 1 n 1 n k=0 θ k Which learning rate sequence γ n? Classical setting: γ n = Cn α 114
115 Convex stochastic approximation Key properties of f and/or f n Smoothness: f B-Lipschitz continuous, f L-Lipschitz continuous Strong convexity: f µ-strongly convex Key algorithm: Stochastic gradient descent (a.k.a. Robbins-Monro) θ n = θ n 1 γ n f n(θ n 1 ) Polyak-Ruppert averaging: θ n = 1 n 1 n k=0 θ k Which learning rate sequence γ n? Classical setting: γ n = Cn α Desirable practical behavior Applicable (at least) to classical supervised learning problems Robustness to (potentially unknown) constants (L,B,µ) Adaptivity to difficulty of the problem (e.g., strong convexity) 115
116 Stochastic subgradient descent /method Assumptions f n convex and B-Lipschitz-continuous on { θ 2 D} (f n ) i.i.d. functions such that Ef n = f θ global optimum of f on C = { θ 2 D} ( Algorithm: θ n = Π D θ n 1 2D ) B n f n(θ n 1 ) 116
117 Stochastic subgradient descent /method Assumptions f n convex and B-Lipschitz-continuous on { θ 2 D} (f n ) i.i.d. functions such that Ef n = f θ global optimum of f on C = { θ 2 D} ( Algorithm: θ n = Π D θ n 1 2D ) B n f n(θ n 1 ) Bound: Ef ( 1 n n 1 k=0 θ k ) f(θ ) 2DB n Same three-line proof as in the deterministic case Minimax rate (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Running-time complexity: O(dn) after n iterations 117
118 Stochastic subgradient method - proof - I Iteration: θ n = Π D (θ n 1 γ n f n(θ n 1 )) with γ n = 2D B n F n : information up to time n f n(θ) 2 B and θ 2 D, unbiased gradients/functions E(f n F n 1 ) = f θ n θ 2 2 θ n 1 θ γ n f n(θ n 1 ) 2 2 by contractivity of projections θ n 1 θ 2 2 +B 2 γ 2 n 2γ n (θ n 1 θ ) f n(θ n 1 ) because f n(θ n 1 ) 2 E [ θ n θ 2 2 F n 1 ] θn 1 θ 2 2+B 2 γ 2 n 2γ n (θ n 1 θ ) f (θ n 1 ) θ n 1 θ 2 2+B 2 γ 2 n 2γ n [ f(θn 1 ) f(θ ) ] (subgradient proper E θ n θ 2 2 E θ n 1 θ 2 2+B 2 γ 2 n 2γ n [ Ef(θn 1 ) f(θ ) ] leading to Ef(θ n 1 ) f(θ ) B2 γ n [ ] E θn 1 θ 2 2γ 2 E θ n θ 2 2 n 118
119 Stochastic subgradient method - proof - II Starting from Ef(θ n 1 ) f(θ ) B2 γ n [ ] E θn 1 θ 2 2 E θ n θ 2 2 2γ n n [ Ef(θu 1 ) f(θ ) ] u=1 n u=1 n u=1 B 2 γ u 2 B 2 γ u 2 + n u=1 1 [ E θu 1 θ 2 2γ 2 E θ u θ 2 ] 2 u + 4D2 2γ n 2DB n with γ n = 2D B n Using convexity: Ef ( 1 n n 1 k=0 θ k ) f(θ ) 2DB n 119
120 Stochastic subgradient method Extension to online learning Assume different and arbitrary functions f n : R d R Observations of f n(θ n 1 )+ε n with E(ε n F n 1 ) = 0 and f n(θ n 1 )+ε n B almost surely Performance criterion: (normalized) regret 1 n n i=1 f i (θ i 1 ) inf θ 2 D 1 n n i=1 f i (θ) Warning: often not normalized May not be non-negative (typically is) 120
121 Stochastic subgradient method - online learning - I Iteration: θ n = Π D (θ n 1 γ n (f n(θ n 1 )+ε n )) with γ n = 2D B n F n : information up to time n - θ an arbitrary point such that θ D f n(θ n 1 )+ε n 2 B and θ 2 D, unbiased gradients E(ε n F n 1 ) = 0 θ n θ 2 2 θ n 1 θ γ n (f n(θ n 1 )+ε n ) 2 2 by contractivity of projections θ n 1 θ 2 2+B 2 γ 2 n 2γ n (θ n 1 θ) (f n(θ n 1 )+ε n ) because f n(θ n 1 )+ E [ θ n θ 2 2 F n 1 ] θn 1 θ 2 2+B 2 γ 2 n 2γ n (θ n 1 θ) f n(θ n 1 ) θ n 1 θ 2 2+B 2 γ 2 n 2γ n [ fn (θ n 1 ) f n (θ) ] (subgradient propert E θ n θ 2 2 E θ n 1 θ 2 2+B 2 γ 2 n 2γ n [ Efn (θ n 1 ) f n (θ) ] leading to Ef n (θ n 1 ) f n (θ) B2 γ n [ ] E θn 1 θ 2 2γ 2 E θ n θ 2 2 n 121
122 Stochastic subgradient method - online learning - II Starting from Ef n (θ n 1 ) f n (θ) B2 γ n [ E θn 1 θ 2 2γ 2 E θ n θ 2 ] 2 n n [ Efu (θ u 1 ) f u (θ) ] u=1 n u=1 n u=1 B 2 γ u 2 B 2 γ u 2 + n u=1 1 [ E θu 1 θ 2 2γ 2 E θ u θ 2 ] 2 u + 4D2 2γ n 2DB n with γ n = 2D B n For any θ such that θ D: 2DB n 1 n n k=1 Ef k (θ k 1 ) 1 n n k=1 f k (θ) Online to batch conversion: assuming convexity 122
123 Stochastic subgradient descent - strong convexity - I Assumptions f n convex and B-Lipschitz-continuous (f n ) i.i.d. functions such that Ef n = f f µ-strongly convex on { θ 2 D} θ global optimum of f over { θ 2 D} ) 2 Algorithm: θ n = Π D (θ n 1 µ(n+1) f n(θ n 1 ) Bound: ( 2 Ef n(n+1) n ) kθ k 1 k=1 f(θ ) 2B2 µ(n+1) Same proof than deterministic case (Lacoste-Julien et al., 2012) Minimax rate (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) 123
124 Stochastic subgradient - strong convexity - proof - I Iteration: θ n = Π D (θ n 1 γ n f n(θ t 1 )) with γ n = 2 µ(n+1) Assumption: f n(θ) 2 B and θ 2 D and µ-strong convexity of f θ n θ 2 2 θ n 1 θ γ n f n(θ t 1 ) 2 2 by contractivity of projections θ n 1 θ 2 2+B 2 γn 2γ 2 n (θ n 1 θ ) f n(θ t 1 ) because f n(θ t 1 ) 2 E( F n 1 ) θ n 1 θ 2 2+B 2 γn 2γ 2 [ n f(θn 1 ) f(θ )+ µ 2 θ n 1 θ 2 ] 2 (property of subgradients and strong convexity) leading to Ef(θ n 1 ) f(θ ) B2 γ n B 2 µ(n+1) + µ 2 [ 1 γ n µ ] θ n 1 θ γ n θ n θ 2 2 [n 1 2 ] θn 1 θ 2 2 µ(n+1) θ n θ
125 Stochastic subgradient - strong convexity - proof - II FromEf(θ n 1 ) f(θ ) B2 µ(n+1) +µ 2 [n 1 2 ] E θn 1 θ 2 2 µ(n+1) E θ n θ n u [ Ef(θ u 1 ) f(θ ) ] u=1 n u=1 B 2 u µ(u+1) n [ u(u 1)E θu 1 θ 2 2 u(u+1)e θ u u=1 B2 n µ + 1 [ 0 n(n+1)e θn θ 2 B 4 2] 2 n µ ( 2 Using convexity: Ef n(n+1) n u=1 uθ u 1 ) g(θ ) 2B2 n+1 NB: with step-size γ n = 1/(nµ), extra logarithmic factor (see later) 125
126 Stochastic subgradient descent - strong convexity - II Assumptions f n convex and B-Lipschitz-continuous (f n ) i.i.d. functions such that Ef n = f θ global optimum of g = f + µ No compactness assumption - no projections Algorithm: 2 2 [ ] θ n = θ n 1 µ(n+1) g n(θ n 1 ) = θ n 1 f µ(n+1) n (θ n 1 )+µθ n 1 ( 2 Bound: Eg n(n+1) n k=1 Minimax convergence rate kθ k 1 ) g(θ ) 2B2 µ(n+1) 126
127 Strong convexity - proof with logn factor - I Iteration: θ n = Π D (θ n 1 γ n f n(θ t 1 )) with γ n = 1 µn Assumption: f n(θ) 2 B and θ 2 D and µ-strong convexity of f θ n θ 2 2 θ n 1 θ γ n f n(θ t 1 ) 2 2 by contractivity of projections θ n 1 θ 2 2+B 2 γn 2γ 2 n (θ n 1 θ ) f n(θ t 1 ) because f n(θ t 1 ) 2 E( F n 1 ) θ n 1 θ 2 2+B 2 γn 2γ 2 [ n f(θn 1 ) f(θ )+ µ 2 θ n 1 θ 2 ] 2 (property of subgradients and strong convexity) leading to Ef(θ n 1 ) f(θ ) B2 γ n [ 1 γ n µ ] θ n 1 θ γ n θ n θ 2 2 B2 2µn + µ [ ] n 1 θn 1 θ nµ 2 θ n θ
128 Strong convexity - proof with logn factor - II From Ef(θ n 1 ) f(θ ) B2 2µn + µ [ ] n 1 θn 1 θ 2 2 nµ 2 2 θ n θ 2 2 n [ Ef(θu 1 ) f(θ ) ] u=1 n u=1 B 2 2µu B2 logn 2µ n [ (u 1)E θu 1 θ 2 2 ue θ u θ 2 2] u= [ 0 ne θn θ 2 2 ] B 2 logn 2µ Using convexity: Ef ( 1 n n u=1 Why could this be useful? θ u 1 ) f(θ ) B2 logn 2µ n 128
129 Stochastic subgradient descent - strong convexity Online learning Need logn term for uniform averaging. For all θ: 1 n n i=1 f i (θ i 1 ) 1 n n i=1 f i (θ) B2 logn 2µ n Optimal. See Hazan and Kale (2014). 129
130 Typical result: Ef Beyond convergence in expectation ( 1 n n 1 k=0 θ k ) f(θ ) 2DB n Obtained with simple conditioning arguments High-probability bounds ( Markov inequality: P f ( 1 n ) n 1 k=0 θ k ) f(θ ) ε 2DB nε 130
131 Typical result: Ef Beyond convergence in expectation ( 1 n n 1 k=0 θ k ) f(θ ) 2DB n Obtained with simple conditioning arguments High-probability bounds ( Markov inequality: P f ( 1 n ) n 1 k=0 θ k ) f(θ ) ε 2DB nε Deviation inequality (Nemirovski et al., 2009; Nesterov and Vial, 2008) ( ( n 1 1 P f n k=0 θ k ) f(θ ) 2DB ) (2+4t) n 2exp( t 2 ) See also Bach (2013) for logistic regression 131
132 Stochastic subgradient method - high probability - I Iteration: θ n = Π D (θ n 1 γ n f n(θ n 1 )) with γ n = 2D B n F n : information up to time n f n(θ) 2 B and θ 2 D, unbiased gradients/functions E(f n F n 1 ) = f θ n θ 2 2 θ n 1 θ γ n f n(θ n 1 ) 2 2 by contractivity of projections θ n 1 θ 2 2 +B 2 γ 2 n 2γ n (θ n 1 θ ) f n(θ n 1 ) because f n(θ n 1 ) 2 E [ θ n θ 2 2 F n 1 ] θn 1 θ 2 2+B 2 γ 2 n 2γ n (θ n 1 θ ) f (θ n 1 ) θ n 1 θ 2 2+B 2 γ 2 n 2γ n [ f(θn 1 ) f(θ ) ] (subgradient proper Without expectations and with Z n = 2γ n (θ n 1 θ ) [f n(θ n 1 ) f (θ n 1 )] θ n θ 2 2 θ n 1 θ 2 2+B 2 γ 2 n 2γ n [ f(θn 1 ) f(θ ) ] +Z n 132
133 Stochastic subgradient method - high probability - II Without expectations and with Z n = 2γ n (θ n 1 θ ) [f n(θ n 1 ) f (θ n 1 )] θ n θ 2 2 θ n 1 θ 2 2+B 2 γ 2 n 2γ n [ f(θn 1 ) f(θ ) ] +Z n f(θ n 1 ) f(θ ) 1 [ θn 1 θ 2 2γ 2 θ n θ 2 ] B 2 γ n 2 + n 2 + Z n 2γ n n [ f(θu 1 ) f(θ ) ] u=1 n u=1 n u=1 B 2 γ u 2 B 2 γ u 2 + n u=1 + 4D2 2γ n + 1 [ θu 1 θ 2 2γ 2 θ u θ 2 ] n 2 + u u=1 n n u=1 Z u 2DB + 2γ u n u=1 Z u 2γ u Z u with γ n = 2D 2γ u B n Need to study n u=1 Z u 2γ u with E(Z n F n 1 ) = 0 and Z n 8γ n DB 133
134 Stochastic subgradient method - high probability - III Need to study n u=1 Z u 2γ u with E( Z n 2γ n F n 1 ) = 0 and Z n 4DB Azuma-Hoeffding inequality for bounded martingale increments: ( n P u=1 Z u t ) n 4DB 2γ u exp ( t2 2 ) Moments with Burkholder-Rosenthal-Pinelis inequality(pinelis, 1994) 134
135 Beyond stochastic gradient method Adding a proximal step Goal: min θ R df(θ)+ω(θ) = Ef n(θ)+ω(θ) Replace recursion θ n = θ n 1 γ n f n(θ n ) by θ n = min θ R d θ θ n 1 +γ n f n(θ n ) 2 2 +CΩ(θ) Xiao (2010); Hu et al. (2009) May be accelerated (Ghadimi and Lan, 2013) Related frameworks Regularized dual averaging (Nesterov, 2009; Xiao, 2010) Mirror descent (Nemirovski et al., 2009; Lan et al., 2012) 135
Large-scale machine learning and convex optimization
Large-scale machine learning and convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE IFCAM, Bangalore - July 2014 Big data revolution? A new scientific
More informationLarge-scale machine learning and convex optimization
Large-scale machine learning and convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Allerton Conference - September 2015 Slides available at www.di.ens.fr/~fbach/gradsto_allerton.pdf
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July
More informationLecture 2 February 25th
Statistical machine learning and convex optimization 06 Lecture February 5th Lecturer: Francis Bach Scribe: Guillaume Maillard, Nicolas Brosse This lecture deals with classical methods for convex optimization.
More informationStochastic gradient descent and robustness to ill-conditioning
Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,
More informationStochastic gradient methods for machine learning
Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - September 2012 Context Machine
More informationStochastic gradient methods for machine learning
Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,
More informationStochastic optimization: Beyond stochastic gradients and convexity Part I
Stochastic optimization: Beyond stochastic gradients and convexity Part I Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint tutorial with Suvrit Sra, MIT - NIPS
More informationOptimization for Machine Learning
Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationMinimizing Finite Sums with the Stochastic Average Gradient Algorithm
Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale
More informationLinearly-Convergent Stochastic-Gradient Methods
Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure
More informationProximal Minimization by Incremental Surrogate Optimization (MISO)
Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine
More informationStatistical Optimality of Stochastic Gradient Descent through Multiple Passes
Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Loucas Pillaud-Vivien
More informationLearning with stochastic proximal gradient
Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationPerturbed Proximal Gradient Algorithm
Perturbed Proximal Gradient Algorithm Gersende FORT LTCI, CNRS, Telecom ParisTech Université Paris-Saclay, 75013, Paris, France Large-scale inverse problems and optimization Applications to image processing
More informationStochastic Optimization
Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationAdaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ
More informationTrade-Offs in Distributed Learning and Optimization
Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed
More informationFast Stochastic Optimization Algorithms for ML
Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2
More informationConvex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013
Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for
More informationStochastic Gradient Descent. CS 584: Big Data Analytics
Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration
More informationIncremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning
Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Julien Mairal Inria, LEAR Team, Grenoble Journées MAS, Toulouse Julien Mairal Incremental and Stochastic
More informationOLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research
OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and
More informationStochastic Optimization Part I: Convex analysis and online stochastic optimization
Stochastic Optimization Part I: Convex analysis and online stochastic optimization Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical
More informationFull-information Online Learning
Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2
More informationarxiv: v1 [math.oc] 1 Jul 2016
Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the
More information1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method
L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order
More informationStochastic Proximal Gradient Algorithm
Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind
More informationA Greedy Framework for First-Order Optimization
A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts
More information6. Proximal gradient method
L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers
More informationStochastic Subgradient Method
Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic
More informationOnline Learning and Sequential Decision Making
Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning
More informationStochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationStochastic Optimization: First order method
Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO
More informationCSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent
CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic
More informationLecture 1: Supervised Learning
Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)
More informationMaking Gradient Descent Optimal for Strongly Convex Stochastic Optimization
Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania
More informationOn the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,
Math 30 Winter 05 Solution to Homework 3. Recognizing the convexity of g(x) := x log x, from Jensen s inequality we get d(x) n x + + x n n log x + + x n n where the equality is attained only at x = (/n,...,
More informationStatistics 300B Winter 2018 Final Exam Due 24 Hours after receiving it
Statistics 300B Winter 08 Final Exam Due 4 Hours after receiving it Directions: This test is open book and open internet, but must be done without consulting other students. Any consultation of other students
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationLecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning
Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationThe FTRL Algorithm with Strongly Convex Regularizers
CSE599s, Spring 202, Online Learning Lecture 8-04/9/202 The FTRL Algorithm with Strongly Convex Regularizers Lecturer: Brandan McMahan Scribe: Tamara Bonaci Introduction In the last lecture, we talked
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationSimple Optimization, Bigger Models, and Faster Learning. Niao He
Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture
More informationOptimized first-order minimization methods
Optimized first-order minimization methods Donghwan Kim & Jeffrey A. Fessler EECS Dept., BME Dept., Dept. of Radiology University of Michigan web.eecs.umich.edu/~fessler UM AIM Seminar 2014-10-03 1 Disclosure
More informationConvex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization
Convex Optimization Ofer Meshi Lecture 6: Lower Bounds Constrained Optimization Lower Bounds Some upper bounds: #iter μ 2 M #iter 2 M #iter L L μ 2 Oracle/ops GD κ log 1/ε M x # ε L # x # L # ε # με f
More informationA Sparsity Preserving Stochastic Gradient Method for Composite Optimization
A Sparsity Preserving Stochastic Gradient Method for Composite Optimization Qihang Lin Xi Chen Javier Peña April 3, 11 Abstract We propose new stochastic gradient algorithms for solving convex composite
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationOn Nesterov s Random Coordinate Descent Algorithms - Continued
On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower
More informationSVRG++ with Non-uniform Sampling
SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract
More informationLasso: Algorithms and Extensions
ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions
More informationConvergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity
Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Benjamin Grimmer Abstract We generalize the classic convergence rate theory for subgradient methods to
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationOn the Generalization Ability of Online Strongly Convex Programming Algorithms
On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract
More informationOracle Complexity of Second-Order Methods for Smooth Convex Optimization
racle Complexity of Second-rder Methods for Smooth Convex ptimization Yossi Arjevani had Shamir Ron Shiff Weizmann Institute of Science Rehovot 7610001 Israel Abstract yossi.arjevani@weizmann.ac.il ohad.shamir@weizmann.ac.il
More informationAdaptive Online Gradient Descent
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow
More informationNon-Smooth, Non-Finite, and Non-Convex Optimization
Non-Smooth, Non-Finite, and Non-Convex Optimization Deep Learning Summer School Mark Schmidt University of British Columbia August 2015 Complex-Step Derivative Using complex number to compute directional
More informationBits of Machine Learning Part 1: Supervised Learning
Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationModern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization
Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where
More informationStochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure
Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,
More informationGradient Descent. Sargur Srihari
Gradient Descent Sargur srihari@cedar.buffalo.edu 1 Topics Simple Gradient Descent/Ascent Difficulties with Simple Gradient Descent Line Search Brent s Method Conjugate Gradient Descent Weight vectors
More informationOnline Learning and Online Convex Optimization
Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationAlgorithmic Stability and Generalization Christoph Lampert
Algorithmic Stability and Generalization Christoph Lampert November 28, 2018 1 / 32 IST Austria (Institute of Science and Technology Austria) institute for basic research opened in 2009 located in outskirts
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationOnline Convex Optimization
Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed
More informationLinear Convergence under the Polyak-Łojasiewicz Inequality
Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationIFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent
IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):
More informationSelected Topics in Optimization. Some slides borrowed from
Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model
More informationStochastic Gradient Descent with Only One Projection
Stochastic Gradient Descent with Only One Projection Mehrdad Mahdavi, ianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi Dept. of Computer Science and Engineering, Michigan State University, MI, USA Machine
More informationOptimal Regularized Dual Averaging Methods for Stochastic Optimization
Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business
More informationStochastic Composition Optimization
Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators
More informationProximal and First-Order Methods for Convex Optimization
Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,
More informationBregman Divergence and Mirror Descent
Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,
More informationLinear Convergence under the Polyak-Łojasiewicz Inequality
Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning
More informationSubgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725
Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:
More informationOptimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationStatistical Machine Learning II Spring 2017, Learning Theory, Lecture 4
Statistical Machine Learning II Spring 07, Learning Theory, Lecture 4 Jean Honorio jhonorio@purdue.edu Deterministic Optimization For brevity, everywhere differentiable functions will be called smooth.
More informationORIE 4741: Learning with Big Messy Data. Proximal Gradient Method
ORIE 4741: Learning with Big Messy Data Proximal Gradient Method Professor Udell Operations Research and Information Engineering Cornell November 13, 2017 1 / 31 Announcements Be a TA for CS/ORIE 1380:
More informationTutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer
Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer Agenda 1. Learning as mathematical optimization Stochastic optimization, ERM,
More informationRelative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent
Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Haihao Lu August 3, 08 Abstract The usual approach to developing and analyzing first-order
More informationONLINE VARIANCE-REDUCING OPTIMIZATION
ONLINE VARIANCE-REDUCING OPTIMIZATION Nicolas Le Roux Google Brain nlr@google.com Reza Babanezhad University of British Columbia rezababa@cs.ubc.ca Pierre-Antoine Manzagol Google Brain manzagop@google.com
More informationGeneralized Conditional Gradient and Its Applications
Generalized Conditional Gradient and Its Applications Yaoliang Yu University of Alberta UBC Kelowna, 04/18/13 Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 1 / 25 1 Introduction 2 Generalized
More informationA Parallel SGD method with Strong Convergence
A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,
More informationmin f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;
Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many
More information