Large-scale machine learning and convex optimization
|
|
- Poppy Day
- 5 years ago
- Views:
Transcription
1 Large-scale machine learning and convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Allerton Conference - September 2015 Slides available at
2 Context Machine learning for big data Large-scale machine learning: large d, large n, large k d : dimension of each observation (input) n : number of observations k : number of tasks (dimension of outputs) Examples: computer vision, bioinformatics, text processing Ideal running-time complexity: O(dn + kn) Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951) Mixing statistics and optimization Using smoothness to go beyond stochastic gradient descent
3 Visual object recognition
4 Personal photos
5 Learning for bioinformatics - Proteins Crucial components of cell life Predicting multiple functions and interactions Massive data: up to 1 millions for humans! Complex data Amino-acid sequence Link with DNA Tri-dimensional molecule
6 Search engines - advertising
7 Context Machine learning for big data Large-scale machine learning: large d, large n, large k d : dimension of each observation (input) n : number of observations k : number of tasks (dimension of outputs) Examples: computer vision, bioinformatics, text processing Ideal running-time complexity: O(dn + kn) Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951) Mixing statistics and optimization Using smoothness to go beyond stochastic gradient descent
8 Context Machine learning for big data Large-scale machine learning: large d, large n, large k d : dimension of each observation (input) n : number of observations k : number of tasks (dimension of outputs) Examples: computer vision, bioinformatics, text processing Ideal running-time complexity: O(dn + kn) Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951) Mixing statistics and optimization Using smoothness to go beyond stochastic gradient descent
9 Outline 1. Large-scale machine learning and optimization Traditional statistical analysis Classical methods for convex optimization 2. Non-smooth stochastic approximation Stochastic (sub)gradient and averaging Non-asymptotic results and lower bounds Strongly convex vs. non-strongly convex 3. Smooth stochastic approximation algorithms Asymptotic and non-asymptotic results Beyond decaying step-sizes 4. Finite data sets
10 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ Φ(x) of features Φ(x) R d (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i,θ Φ(x i ) ) + µω(θ) θ R d n i=1 convex data fitting term + regularizer
11 Usual losses Regression: y R, prediction ŷ = θ Φ(x) quadratic loss 1 2 (y ŷ)2 = 1 2 (y θ Φ(x)) 2
12 Usual losses Regression: y R, prediction ŷ = θ Φ(x) quadratic loss 1 2 (y ŷ)2 = 1 2 (y θ Φ(x)) 2 Classification : y { 1,1}, prediction ŷ = sign(θ Φ(x)) loss of the form l(yθ Φ(x)) True 0-1 loss: l(yθ Φ(x)) = 1 yθ Φ(x)<0 Usual convex losses: hinge square logistic
13 Main motivating examples Support vector machine (hinge loss) l(y,θ Φ(X)) = max{1 Yθ Φ(X),0} Logistic regression l(y,θ Φ(X)) = log(1+exp( Yθ Φ(X))) Least-squares regression l(y,θ Φ(X)) = 1 2 (Y θ Φ(X)) 2
14 Usual regularizers Main goal: avoid overfitting (squared) Euclidean norm: θ 2 2 = d j=1 θ j 2 Numerically well-behaved Representer theorem and kernel methods : θ = n i=1 α iφ(x i ) See, e.g., Schölkopf and Smola (2001); Shawe-Taylor and Cristianini (2004) Sparsity-inducing norms Main example: l 1 -norm θ 1 = d j=1 θ j Perform model selection as well as regularization Non-smooth optimization and structured sparsity See, e.g., Bach, Jenatton, Mairal, and Obozinski (2012b,a)
15 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ Φ(x) of features Φ(x) R d (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i,θ Φ(x i ) ) + µω(θ) θ R d n i=1 convex data fitting term + regularizer
16 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ Φ(x) of features Φ(x) R d (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i,θ Φ(x i ) ) + µω(θ) θ R d n i=1 convex data fitting term + regularizer Empirical risk: ˆf(θ) = 1 n n i=1 l(y i,θ Φ(x i )) training cost Expected risk: f(θ) = E (x,y) l(y,θ Φ(x)) testing cost Two fundamental questions: (1) computing ˆθ and (2) analyzing ˆθ May be tackled simultaneously
17 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ Φ(x) of features Φ(x) R d (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i,θ Φ(x i ) ) + µω(θ) θ R d n i=1 convex data fitting term + regularizer Empirical risk: ˆf(θ) = 1 n n i=1 l(y i,θ Φ(x i )) training cost Expected risk: f(θ) = E (x,y) l(y,θ Φ(x)) testing cost Two fundamental questions: (1) computing ˆθ and (2) analyzing ˆθ May be tackled simultaneously
18 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ Φ(x) of features Φ(x) R d (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i,θ Φ(x i ) ) such that Ω(θ) D θ R d n i=1 convex data fitting term + constraint Empirical risk: ˆf(θ) = 1 n n i=1 l(y i,θ Φ(x i )) training cost Expected risk: f(θ) = E (x,y) l(y,θ Φ(x)) testing cost Two fundamental questions: (1) computing ˆθ and (2) analyzing ˆθ May be tackled simultaneously
19 General assumptions Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Bounded features Φ(x) R d : Φ(x) 2 R Empirical risk: ˆf(θ) = 1 n n i=1 l(y i,θ Φ(x i )) training cost Expected risk: f(θ) = E (x,y) l(y,θ Φ(x)) testing cost Loss for a single observation: f i (θ) = l(y i,θ Φ(x i )) i, f(θ) = Ef i (θ) Properties of f i,f, ˆf Convex on R d Additional regularity assumptions: Lipschitz-continuity, smoothness and strong convexity
20 Lipschitz continuity Bounded gradients of f (Lipschitz-continuity): the function f if convex, differentiable and has (sub)gradients uniformly bounded by B on the ball of center 0 and radius D: Machine learning θ R d, θ 2 D f (θ) 2 B with f(θ) = 1 n n i=1 l(y i,θ Φ(x i )) G-Lipschitz loss and R-bounded data: B = GR
21 Smoothness and strong convexity A function f : R d R is L-smooth if and only if it is differentiable and its gradient is L-Lipschitz-continuous θ 1,θ 2 R d, f (θ 1 ) f (θ 2 ) 2 L θ 1 θ 2 2 If f is twice differentiable: θ R d, f (θ) L Id smooth non smooth
22 Smoothness and strong convexity A function f : R d R is L-smooth if and only if it is differentiable and its gradient is L-Lipschitz-continuous θ 1,θ 2 R d, f (θ 1 ) f (θ 2 ) 2 L θ 1 θ 2 2 If f is twice differentiable: θ R d, f (θ) L Id Machine learning with f(θ) = 1 n n i=1 l(y i,θ Φ(x i )) Hessian covariance matrix 1 n n i=1 Φ(x i)φ(x i ) l-smooth loss and R-bounded data: L = lr 2
23 Smoothness and strong convexity A function f : R d R is µ-strongly convex if and only if θ 1,θ 2 R d, f(θ 1 ) f(θ 2 )+f (θ 2 ) (θ 1 θ 2 )+ µ 2 θ 1 θ If f is twice differentiable: θ R d, f (θ) µ Id convex strongly convex
24 Smoothness and strong convexity A function f : R d R is µ-strongly convex if and only if θ 1,θ 2 R d, f(θ 1 ) f(θ 2 )+f (θ 2 ) (θ 1 θ 2 )+ µ 2 θ 1 θ If f is twice differentiable: θ R d, f (θ) µ Id Machine learning with f(θ) = 1 n n i=1 l(y i,θ Φ(x i )) Hessian covariance matrix 1 n n i=1 Φ(x i)φ(x i ) Data with invertible covariance matrix (low correlation/dimension)
25 Smoothness and strong convexity A function f : R d R is µ-strongly convex if and only if θ 1,θ 2 R d, f(θ 1 ) f(θ 2 )+f (θ 2 ) (θ 1 θ 2 )+ µ 2 θ 1 θ If f is twice differentiable: θ R d, f (θ) µ Id Machine learning with f(θ) = 1 n n i=1 l(y i,θ Φ(x i )) Hessian covariance matrix 1 n n i=1 Φ(x i)φ(x i ) Data with invertible covariance matrix (low correlation/dimension) Adding regularization by µ 2 θ 2 creates additional bias unless µ is small
26 Summary of smoothness/convexity assumptions Bounded gradients of f (Lipschitz-continuity): the function f if convex, differentiable and has (sub)gradients uniformly bounded by B on the ball of center 0 and radius D: θ R d, θ 2 D f (θ) 2 B Smoothness of f: the function f is convex, differentiable with L-Lipschitz-continuous gradient f : θ 1,θ 2 R d, f (θ 1 ) f (θ 2 ) 2 L θ 1 θ 2 2 Strong convexity of f: The function f is strongly convex with respect to the norm, with convexity constant µ > 0: θ 1,θ 2 R d, f(θ 1 ) f(θ 2 )+f (θ 2 ) (θ 1 θ 2 )+ µ 2 θ 1 θ 2 2 2
27 Analysis of empirical risk minimization Approximation and estimation errors: Θ = {θ R d,ω(θ) D} [ ] [ ] f(ˆθ) min θ R df(θ) = f(ˆθ) min f(θ) + minf(θ) min θ Θ θ Θ θ R df(θ) Estimation error Approximation error NB: may replace min by best (non-linear) predictions θ Rdf(θ)
28 Analysis of empirical risk minimization Approximation and estimation errors: Θ = {θ R d,ω(θ) D} [ ] [ ] f(ˆθ) min θ R df(θ) = f(ˆθ) min f(θ) + minf(θ) min θ Θ θ Θ θ R df(θ) Estimation error Approximation error 1. Uniform deviation bounds, with ˆθ argmin θ Θ ˆf(θ) f(ˆθ) min θ Θ f(θ) = [ f(ˆθ) ˆf(ˆθ) ] + [ˆf(ˆθ) ˆf(θ Θ ) ] + [ˆf(θ Θ ) f(θ Θ) ] sup ˆf(θ) f(θ) + 0 +sup ˆf(θ) f(θ) θ Θ θ Θ
29 Analysis of empirical risk minimization Approximation and estimation errors: Θ = {θ R d,ω(θ) D} [ ] [ ] f(ˆθ) min θ R df(θ) = f(ˆθ) min f(θ) + minf(θ) min θ Θ θ Θ θ R df(θ) Estimation error Approximation error 1. Uniform deviation bounds, with ˆθ argmin θ Θ ˆf(θ) f(ˆθ) min θ Θ Typically slow rate O ( 1/ n ) f(θ) 2 sup ˆf(θ) f(θ) θ Θ 2. More refined concentration results with faster rates O(1/n)
30 Motivation from least-squares For least-squares, we have l(y,θ Φ(x)) = 1 2 (y θ Φ(x)) 2, and sup θ 2 D f(θ) ˆf(θ) = 1 ( 1 n 2 θ Φ(x i )Φ(x i ) EΦ(X)Φ(X) )θ n i=1 ( 1 n ) θ y i Φ(x i ) EYΦ(X) + 1 ( 1 n yi 2 EY ), 2 n 2 n i=1 i=1 1 n i )Φ(x i ) n i=1φ(x EΦ(X)Φ(X) op +D 1 n y i Φ(x i ) EYΦ(X) n n 2 2 yi 2 EY 2, n f(θ) ˆf(θ) D2 2 i=1 sup f(θ) ˆf(θ) O(1/ n) with high probability θ 2 D i=1
31 Symmetrization with Rademacher variables Let D = {x 1,y 1,...,x n,y n} an independent copy of the data D = {x 1,y 1,...,x n,y n }, with corresponding loss functions f i (θ) E [ sup f(θ) ˆf(θ) ] = E [ ( sup f(θ) 1 n ) ] f i (θ) θ Θ θ Θ n [ i=1 = E sup 1 n E ( f θ Θ n i(θ) f i (θ) D ) ] i=1 [ [ E E sup 1 n ]] ( f θ Θ n i (θ) f i (θ) D i=1 [ = E sup 1 n ( f θ Θ n i (θ) f i (θ) ) ] i=1 [ = E sup 1 n ( ε i f θ Θ n i (θ) f i (θ) ) ] with ε i uniform in { 1,1} i=1 [ 2E 1 n ] ε i f i (θ) n = Rademacher complexity sup θ Θ i=1
32 Rademacher complexity DefinetheRademachercomplexity of theclass of functions(x,y) l(y,θ Φ(X)) as [ R n = E sup θ Θ 1 n n ] ε i f i (θ). Note two expectations, with respect to D and with respect to ε i=1 Main property: E [ ] sup f(θ) ˆf(θ) 2Rn θ Θ
33 From Rademacher complexity to bound in high probability Let Z = sup θ Θ f(θ) ˆf(θ) By changing the pair (x i,y i ), Z may only change by 2 n sup l(y,θ Φ(X)) 2 ( ) 2( sup l(y,0) +GRD l0 +GRD ) = c n n with sup l(y,0) = l 0 MacDiarmid inequality: with probability greater than 1 δ, Z EZ + n 2 c log 1 δ 2R n+ 2 n ( l0 +GRD ) log 1 δ
34 Bounding the Rademacher average - I Wehave,withϕ i (u) = l(y i,u) l(y i,0)isalmostsurelyb-lipschitz: [ R n = E sup 1 θ Θ n [ E sup 1 θ Θ n l [ 0 +E n n ] ε i f i (θ) i=1 n ] ε i f i (0) i=1 sup θ Θ = l [ 0 +E sup n θ Θ 1 n 1 n [ +E sup θ Θ 1 n n [ ε i fi (θ) f i (0) ] ] i=1 n [ ε i fi (θ) f i (0) ]] i=1 n i=1 ] ε i ϕ i (θ Φ(x i )) Using Ledoux-Talagrand concentration results for Rademacher averages (since ϕ i is G-Lipschitz), we get: R n l [ 0 +2G E sup 1 n θ 2 D n n ε i θ Φ(x i )] i=1
35 We have: Bounding the Rademacher average - II R n l 0 n +2GE [ = l 0 +2GE n D1 n l 0 +2GD n 2(l 0+GRD) n sup 1 n ε i θ Φ(x i )] θ 2 D n i=1 n ε i Φ(x i ) i=1 E 1 n 2 n ε i Φ(x i ) i=1 by using Φ(x) 2 R 2 2 by Jensen s inequality Overall, we get, with probability 1 δ: sup f(θ) ˆf(θ) 1 ( n l0 +GRD)(4+ θ Θ 2log 1 δ )
36 We have, with probability 1 δ Putting it all together For exact minimizer ˆθ argmin θ Θ ˆf(θ), we have f(ˆθ) min θ Θ For all θ Θ f(θ) min θ Θ f(θ) 2 sup ˆf(θ) f(θ) θ Θ 2 n ( l0 +GRD)(4+ 2log 1 δ f(θ) 2 sup ˆf(θ) f(θ) + [ˆf(θ) ] ˆf(ˆθ) θ Θ Only need to optimize with precision 2 n (l 0 +GRD) )
37 Slow rate for supervised learning (summary) Assumptions (f is the expected risk, ˆf the empirical risk) Ω(θ) = θ 2 (Euclidean norm) Linear predictors: θ(x) = θ Φ(x), with Φ(x) 2 R a.s. G-Lipschitz loss: f and ˆf are GR-Lipschitz on Θ = { θ 2 D} No assumptions regarding convexity With probability greater than 1 δ sup ˆf(θ) f(θ) (l 0+GRD) [2+ θ Θ n 2log 2 δ Expectated estimation error: E [ sup ˆf(θ) f(θ) ] 4(l 0+GRD) θ Θ n Using Rademacher averages (see, e.g., Boucheron et al., 2005) Lipschitz functions slow rate ]
38 Motivation from mean estimation Estimator ˆθ = 1 n n i=1 z i = argmin θ R 1 2n n i=1 (θ z i) 2 = ˆf(θ) From before: f(θ) = 1 2 E(θ z)2 = 1 2 (θ Ez) var(z) = ˆf(θ)+O(1/ n) f(ˆθ) = 1 2 (ˆθ Ez) var(z) = f(ez)+o(1/ n)
39 Motivation from mean estimation Estimator ˆθ = 1 n n i=1 z i = argmin θ R 1 2n n i=1 (θ z i) 2 = ˆf(θ) From before: f(θ) = 1 2 E(θ z)2 = 1 2 (θ Ez) var(z) = ˆf(θ)+O(1/ n) f(ˆθ) = 1 2 (ˆθ Ez) var(z) = f(ez)+o(1/ n) More refined/direct bound: f(ˆθ) f(ez) = 1 2 (ˆθ Ez) 2 E [ f(ˆθ) f(ez) ] = 1 2 E ( 1 n n i=1 ) 2 z i Ez = 1 2n var(z) Bound only at ˆθ + strong convexity
40 Fast rate for supervised learning Assumptions (f is the expected risk, ˆf the empirical risk) Same as before (bounded features, Lipschitz loss) Regularized risks: f µ (θ) = f(θ)+ µ 2 θ 2 2 and ˆf µ (θ) = ˆf(θ)+ µ 2 θ 2 2 Convexity For any a > 0, with probability greater than 1 δ, for all θ R d, f µ (θ) min (η) (1+a)(ˆf µ (θ) min ˆfµ (η))+ 8(1+ 1 a )G2 R 2 (32+log 1 δ ) η R dfµ η R d µn Results from Sridharan, Srebro, and Shalev-Shwartz (2008) see also Boucheron and Massart (2011) and references therein Strongly convex functions fast rate Warning: µ should decrease with n to reduce approximation error
41 Complexity results in convex optimization for ML Assumption: f convex on R d Classical generic algorithms (sub)gradient method/descent Accelerated gradient descent Newton method Key additional properties of f Lipschitz continuity, smoothness or strong convexity Key insight from Bottou and Bousquet (2008) In machine learning, no need to optimize below estimation error Key reference: Nesterov (2004)
42 Summary of smoothness/convexity assumptions Bounded gradients of f (Lipschitz-continuity): the function f if convex, differentiable and has (sub)gradients uniformly bounded by B on the ball of center 0 and radius D: θ R d, θ 2 D f (θ) 2 B Smoothness of f: the function f is convex, differentiable with L-Lipschitz-continuous gradient f : θ 1,θ 2 R d, f (θ 1 ) f (θ 2 ) 2 L θ 1 θ 2 2 Strong convexity of f: The function f is strongly convex with respect to the norm, with convexity constant µ > 0: θ 1,θ 2 R d, f(θ 1 ) f(θ 2 )+f (θ 2 ) (θ 1 θ 2 )+ µ 2 θ 1 θ 2 2 2
43 Subgradient method/descent (Shor et al., 1985) Assumptions f convex and B-Lipschitz-continuous on { θ 2 D} ( Algorithm: θ t = Π D θ t 1 2D ) B t f (θ t 1 ) Π D : orthogonal projection onto { θ 2 D} Bound: f ( 1 t t 1 k=0 θ k ) f(θ ) 2DB t Three-line proof (for constant step-size) Best possible convergence rate after O(d) iterations
44 Subgradient method/descent - proof - I Iteration: θ t = Π D (θ t 1 γ t f (θ t 1 )) with γ t = 2D B t Assumption: f (θ) 2 B and θ 2 D θ t θ 2 2 θ t 1 θ γ t f (θ t 1 ) 2 2 by contractivity of projections θ t 1 θ 2 2+B 2 γ 2 t 2γ t (θ t 1 θ ) f (θ t 1 ) because f (θ t 1 ) 2 B θ t 1 θ 2 2+B 2 γ 2 t 2γ t [ f(θt 1 ) f(θ ) ] (property of subgradients) leading to f(θ t 1 ) f(θ ) B2 γ t [ θt 1 θ 2 2γ 2 θ t θ 2 ] 2 t
45 Starting from Subgradient method/descent - proof - II t [ f(θu 1 ) f(θ ) ] u=1 f(θ t 1 ) f(θ ) B2 γ t 2 = = Using convexity: f t u=1 t u=1 ( 1 t t u=1 t u=1 B 2 γ u 2 B 2 γ u 2 B 2 γ u 2 B 2 γ u 2 t 1 k=0 + + t u=1 t 1 u=1 + θ k ) t 1 u=1 + 1 [ θt 1 θ 2 2γ 2 θ t θ 2 ] 2 t 1 [ θu 1 θ 2 2γ 2 θ u θ 2 ] 2 u ( θ u θ ) θ 0 θ θ t θ 2 2 2γ u+1 2γ u 2γ 1 2γ t 4D 2( 1 2γ u+1 1 2γ u ) + 4D 2 2γ 1 + 4D2 2γ t 2DB t with γ t = 2D B t f(θ ) 2DB t
46 Subgradient descent for machine learning Assumptions (f is the expected risk, ˆf the empirical risk) Linear predictors: θ(x) = θ Φ(x), with Φ(x) 2 R a.s. ˆf(θ) = 1 n n i=1 l(y i,φ(x i ) θ) G-Lipschitz loss: f and ˆf are GR-Lipschitz on Θ = { θ 2 D} Statistics: with probability greater than 1 δ sup ˆf(θ) f(θ) GRD [2+ 2log 2 θ Θ n δ ] Optimization: after t iterations of subgradient method ˆf(ˆθ) min η Θ ˆf(η) GRD t t = n iterations, with total running-time complexity of O(n 2 d)
47 Subgradient descent - strong convexity Assumptions f convex and B-Lipschitz-continuous on { θ 2 D} f µ-strongly convex ) 2 Algorithm: θ t = Π D (θ t 1 µ(t+1) f (θ t 1 ) Bound: f ( 2 t(t+1) t k=1 kθ k 1 ) f(θ ) 2B2 µ(t+1) Three-line proof Best possible convergence rate after O(d) iterations
48 Subgradient method - strong convexity - proof - I Iteration: θ t = Π D (θ t 1 γ t f (θ t 1 )) with γ t = 2 µ(t+1) Assumption: f (θ) 2 B and θ 2 D and µ-strong convexity of f θ t θ 2 2 θ t 1 θ γ t f (θ t 1 ) 2 2 by contractivity of projections θ t 1 θ 2 2+B 2 γt 2 2γ t (θ t 1 θ ) f (θ t 1 ) because f (θ t 1 ) 2 B θ t 1 θ 2 2+B 2 γt 2 [ 2γ t f(θt 1 ) f(θ )+ µ 2 θ t 1 θ 2 ] 2 (property of subgradients and strong convexity) leading to f(θ t 1 ) f(θ ) B2 γ t B 2 µ(t+1) + µ 2 [1 γ t µ ] θ t 1 θ γ t θ t θ 2 2 [t 1 2 ] θt 1 θ 2 2 µ(t+1) θ t θ 2 2 4
49 Subgradient method - strong convexity - proof - II From f(θ t 1 ) f(θ ) B2 µ(t+1) + µ 2 [t 1 2 ] θt 1 θ 2 2 µ(t+1) θ t θ t u [ f(θ u 1 ) f(θ ) ] u=1 u t=1 B 2 u µ(u+1) t [ u(u 1) θu 1 θ 2 2 u(u+1) θ u θ 2 2] u=1 B2 t µ + 1 [ 0 t(t+1) θt θ 2 B 4 2] 2 t µ Using convexity: f ( 2 t(t+1) t u=1 uθ u 1 ) f(θ ) 2B2 µ(t+1) NB: with step-size γ n = 1/(nµ), extra logarithmic factor
50 (smooth) gradient descent Assumptions f convex with L-Lipschitz-continuous gradient Minimum attained at θ Algorithm: Bound: Three-line proof θ t = θ t 1 1 L f (θ t 1 ) f(θ t ) f(θ ) 2L θ 0 θ 2 t+4 Not best possible convergence rate after O(d) iterations
51 (smooth) gradient descent - strong convexity Assumptions f convex with L-Lipschitz-continuous gradient f µ-strongly convex Algorithm: Bound: θ t = θ t 1 1 L f (θ t 1 ) f(θ t ) f(θ ) (1 µ/l) t[ f(θ 0 ) f(θ ) ] Three-line proof Adaptivity of gradient descent to problem difficulty Line search
52 Accelerated gradient methods (Nesterov, 1983) Assumptions f convex with L-Lipschitz-cont. gradient, min. attained at θ Algorithm: Bound: Ten-line proof θ t = η t 1 1 L f (η t 1 ) η t = θ t + t 1 t+2 (θ t θ t 1 ) f(θ t ) f(θ ) 2L θ 0 θ 2 (t+1) 2 Not improvable Extension to strongly convex functions
53 Optimization for sparsity-inducing norms (see Bach, Jenatton, Mairal, and Obozinski, 2012b) Gradient descent as a proximal method (differentiable functions) θ t+1 = arg min θ R df(θ t)+(θ θ t ) f(θ t )+ L 2 θ θ t 2 2 θ t+1 = θ t 1 L f(θ t)
54 Optimization for sparsity-inducing norms (see Bach, Jenatton, Mairal, and Obozinski, 2012b) Gradient descent as a proximal method (differentiable functions) θ t+1 = arg min θ R df(θ t)+(θ θ t ) f(θ t )+ L 2 θ θ t 2 2 θ t+1 = θ t 1 L f(θ t) Problems of the form: min θ R df(θ)+µω(θ) θ t+1 = arg min θ R df(θ t)+(θ θ t ) f(θ t )+µω(θ)+ L 2 θ θ t 2 2 Ω(θ) = θ 1 Thresholded gradient descent Similar convergence rates than smooth optimization Acceleration methods (Nesterov, 2007; Beck and Teboulle, 2009)
55 Summary: minimizing convex functions Assumption: f convex Gradient descent: θ t = θ t 1 γ t f (θ t 1 ) O(1/ t) convergence rate for non-smooth convex functions O(1/t) convergence rate for smooth convex functions O(e ρt ) convergence rate for strongly smooth convex functions Newton method: θ t = θ t 1 f (θ t 1 ) 1 f (θ t 1 ) O ( e ρ2t ) convergence rate
56 Summary: minimizing convex functions Assumption: f convex Gradient descent: θ t = θ t 1 γ t f (θ t 1 ) O(1/ t) convergence rate for non-smooth convex functions O(1/t) convergence rate for smooth convex functions O(e ρt ) convergence rate for strongly smooth convex functions Newton method: θ t = θ t 1 f (θ t 1 ) 1 f (θ t 1 ) O ( e ρ2t ) convergence rate Key insights from Bottou and Bousquet (2008) 1. In machine learning, no need to optimize below statistical error 2. In machine learning, cost functions are averages Stochastic approximation
57 Outline 1. Large-scale machine learning and optimization Traditional statistical analysis Classical methods for convex optimization 2. Non-smooth stochastic approximation Stochastic (sub)gradient and averaging Non-asymptotic results and lower bounds Strongly convex vs. non-strongly convex 3. Smooth stochastic approximation algorithms Asymptotic and non-asymptotic results Beyond decaying step-sizes 4. Finite data sets
58 Stochastic approximation Goal: Minimizing a function f defined on R d given only unbiased estimates f n(θ n ) of its gradients f (θ n ) at certain points θ n R d Stochastic approximation (much) broader applicability beyond convex optimization θ n = θ n 1 γ n h n (θ n 1 ) with E [ h n (θ n 1 ) θ n 1 ] = h(θn 1 ) Beyond convex problems, i.i.d assumption, finite dimension, etc. Typically asymptotic results See, e.g., Kushner and Yin (2003); Benveniste et al. (2012)
59 Stochastic approximation Goal: Minimizing a function f defined on R d given only unbiased estimates f n(θ n ) of its gradients f (θ n ) at certain points θ n R d Machine learning - statistics loss for a single pair of observations: f n (θ) = l(y n,θ Φ(x n )) f(θ) = Ef n (θ) = El(y n,θ Φ(x n )) = generalization error Expected gradient: f (θ) = Ef n(θ) = E { l (y n,θ Φ(x n ))Φ(x n ) } Non-asymptotic results Number of iterations = number of observations Single line of code!
60 Stochastic approximation Relationship to online learning Minimize f(θ) = E z l(θ,z) = generalization error of θ Using the gradients of single i.i.d. observations
61 Stochastic approximation Relationship to online learning Minimize f(θ) = E z l(θ,z) = generalization error of θ Using the gradients of single i.i.d. observations Batch learning Finite set of observations: z 1,...,z n Empirical risk: ˆf(θ) = 1 n n k=1 l(θ,z i) Estimator ˆθ = Minimizer of ˆf(θ) over a certain class Θ Generalization bound using uniform concentration results
62 Stochastic approximation Relationship to online learning Minimize f(θ) = E z l(θ,z) = generalization error of θ Using the gradients of single i.i.d. observations Batch learning Finite set of observations: z 1,...,z n Empirical risk: ˆf(θ) = 1 n n k=1 l(θ,z i) Estimator ˆθ = Minimizer of ˆf(θ) over a certain class Θ Generalization bound using uniform concentration results Online learning Update ˆθ n after each new (potentially adversarial) observation z n Cumulative loss: 1 n n k=1 l(ˆθ k 1,z k ) Online to batch through averaging (Cesa-Bianchi et al., 2004)
63 Convex stochastic approximation Key properties of f and/or f n Smoothness: f B-Lipschitz continuous, f L-Lipschitz continuous Strong convexity: f µ-strongly convex
64 Convex stochastic approximation Key properties of f and/or f n Smoothness: f B-Lipschitz continuous, f L-Lipschitz continuous Strong convexity: f µ-strongly convex Key algorithm: Stochastic gradient descent (a.k.a. Robbins-Monro) θ n = θ n 1 γ n f n(θ n 1 ) Polyak-Ruppert averaging: θ n = 1 n 1 n k=0 θ k Which learning rate sequence γ n? Classical setting: γ n = Cn α
65 Convex stochastic approximation Key properties of f and/or f n Smoothness: f B-Lipschitz continuous, f L-Lipschitz continuous Strong convexity: f µ-strongly convex Key algorithm: Stochastic gradient descent (a.k.a. Robbins-Monro) θ n = θ n 1 γ n f n(θ n 1 ) Polyak-Ruppert averaging: θ n = 1 n 1 n k=0 θ k Which learning rate sequence γ n? Classical setting: γ n = Cn α Desirable practical behavior Applicable (at least) to classical supervised learning problems Robustness to (potentially unknown) constants (L,B,µ) Adaptivity to difficulty of the problem (e.g., strong convexity)
66 Assumptions Stochastic subgradient descent/method f n convex and B-Lipschitz-continuous on { θ 2 D} (f n ) i.i.d. functions such that Ef n = f θ global optimum of f on { θ 2 D} ( Algorithm: θ n = Π D θ n 1 2D ) B n f n(θ n 1 ) Bound: Ef ( 1 n n 1 k=0 θ k ) f(θ ) 2DB n Same three-line proof as in the deterministic case Minimax rate (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Running-time complexity: O(dn) after n iterations
67 Stochastic subgradient method - proof - I Iteration: θ n = Π D (θ n 1 γ n f n(θ n 1 )) with γ n = 2D B n F n : information up to time n f n(θ) 2 B and θ 2 D, unbiased gradients/functions E(f n F n 1 ) = f θ n θ 2 2 θ n 1 θ γ n f n(θ n 1 ) 2 2 by contractivity of projections θ n 1 θ 2 2 +B 2 γ 2 n 2γ n (θ n 1 θ ) f n(θ n 1 ) because f n(θ n 1 ) 2 B E [ θ n θ 2 2 F n 1 ] θn 1 θ 2 2+B 2 γ 2 n 2γ n (θ n 1 θ ) f (θ n 1 ) θ n 1 θ 2 2+B 2 γ 2 n 2γ n [ f(θn 1 ) f(θ ) ] (subgradient property) E θ n θ 2 2 E θ n 1 θ 2 2+B 2 γ 2 n 2γ n [ Ef(θn 1 ) f(θ ) ] leading to Ef(θ n 1 ) f(θ ) B2 γ n [ ] E θn 1 θ 2 2γ 2 E θ n θ 2 2 n
68 Stochastic subgradient method - proof - II Starting from Ef(θ n 1 ) f(θ ) B2 γ n [ ] E θn 1 θ 2 2 E θ n θ 2 2 2γ n n [ Ef(θu 1 ) f(θ ) ] u=1 n u=1 n u=1 B 2 γ u 2 B 2 γ u 2 + n u=1 1 [ E θu 1 θ 2 2γ 2 E θ u θ 2 ] 2 u + 4D2 2γ n 2DB n with γ n = 2D B n Using convexity: Ef ( 1 n n 1 k=0 θ k ) f(θ ) 2DB n
69 Stochastic subgradient descent - strong convexity - I Assumptions f n convex and B-Lipschitz-continuous (f n ) i.i.d. functions such that Ef n = f f µ-strongly convex on { θ 2 D} θ global optimum of f over { θ 2 D} ) 2 Algorithm: θ n = Π D (θ n 1 µ(n+1) f n(θ n 1 ) Bound: ( 2 Ef n(n+1) n ) kθ k 1 k=1 f(θ ) 2B2 µ(n+1) Same proof than deterministic case (Lacoste-Julien et al., 2012) Minimax rate (Nemirovsky and Yudin, 1983; Agarwal et al., 2012)
70 Stochastic subgradient descent - strong convexity - II Assumptions f n convex and B-Lipschitz-continuous (f n ) i.i.d. functions such that Ef n = f θ global optimum of g = f + µ No compactness assumption - no projections Algorithm: 2 2 [ ] θ n = θ n 1 µ(n+1) g n(θ n 1 ) = θ n 1 f µ(n+1) n (θ n 1 )+µθ n 1 ( 2 Bound: Eg n(n+1) n k=1 kθ k 1 ) g(θ ) 2B2 µ(n+1)
71 Outline 1. Large-scale machine learning and optimization Traditional statistical analysis Classical methods for convex optimization 2. Non-smooth stochastic approximation Stochastic (sub)gradient and averaging Non-asymptotic results and lower bounds Strongly convex vs. non-strongly convex 3. Smooth stochastic approximation algorithms Asymptotic and non-asymptotic results Beyond decaying step-sizes 4. Finite data sets
72 Convex stochastic approximation Existing work Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Bottou and Le Cun (2005); Bottou and Bousquet (2008); Hazan et al. (2007); Shalev-Shwartz and Srebro (2008); Shalev-Shwartz et al. (2007, 2009); Xiao (2010); Duchi and Singer (2009); Nesterov and Vial (2008); Nemirovski et al. (2009)
73 Convex stochastic approximation Existing work Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Many contributions in optimization and online learning: Bottou and Le Cun (2005); Bottou and Bousquet (2008); Hazan et al. (2007); Shalev-Shwartz and Srebro (2008); Shalev-Shwartz et al. (2007, 2009); Xiao (2010); Duchi and Singer (2009); Nesterov and Vial (2008); Nemirovski et al. (2009)
74 Convex stochastic approximation Existing work Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Asymptotic analysis of averaging (Polyak and Juditsky, 1992; Ruppert, 1988) All step sizes γ n = Cn α with α (1/2,1) lead to O(n 1 ) for smooth strongly convex problems A single algorithm with global adaptive convergence rate for smooth problems?
75 Convex stochastic approximation Existing work Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Asymptotic analysis of averaging (Polyak and Juditsky, 1992; Ruppert, 1988) All step sizes γ n = Cn α with α (1/2,1) lead to O(n 1 ) for smooth strongly convex problems Non-asymptotic analysis for smooth problems? problems with convergence rate O(min{1/µn,1/ n})
76 Smoothness/convexity assumptions Iteration: θ n = θ n 1 γ n f n(θ n 1 ) Polyak-Ruppert averaging: θ n = 1 n n 1 k=0 θ k Smoothness of f n : For each n 1, the function f n is a.s. convex, differentiable with L-Lipschitz-continuous gradient f n: Smooth loss and bounded data Strong convexity of f: The function f is strongly convex with respect to the norm, with convexity constant µ > 0: Invertible population covariance matrix or regularization by µ 2 θ 2
77 Summary of new results (Bach and Moulines, 2011) Stochastic gradient descent with learning rate γ n = Cn α Strongly convex smooth objective functions Old: O(n 1 ) rate achieved without averaging for α = 1 New: O(n 1 ) rate achieved with averaging for α [1/2,1] Non-asymptotic analysis with explicit constants Forgetting of initial conditions Robustness to the choice of C
78 Summary of new results (Bach and Moulines, 2011) Stochastic gradient descent with learning rate γ n = Cn α Strongly convex smooth objective functions Old: O(n 1 ) rate achieved without averaging for α = 1 New: O(n 1 ) rate achieved with averaging for α [1/2,1] Non-asymptotic analysis with explicit constants Forgetting of initial conditions Robustness to the choice of C Convergence rates for E θ n θ 2 and E θ n θ 2 ( σ 2 γ ) n no averaging: O +O(e µnγ n ) θ 0 θ 2 µ averaging: trh(θ ) 1 ( +µ 1 O(n 2α +n 2+α θ0 θ 2 ) )+O n µ 2 n 2
79 Classical proof sketch (no averaging) θ n θ 2 2 = θ n 1 γ n f n(θ n 1 ) θ 2 2 = θ n 1 θ 2 2 2γ n (θ n 1 θ ) f n(θ n 1 )+γ 2 n f n(θ n 1 ) 2 2 θ n 1 θ 2 2 2γ n (θ n 1 θ ) f n(θ n 1 ) E [ θ n θ 2 2 F n 1 ] +2γn f 2 n(θ ) 2 2+2γn f 2 n(θ n 1 ) f n(θ ) 2 2 θ n 1 θ 2 2 2γ n (θ n 1 θ ) f n(θ n 1 ) +2γn f 2 n(θ ) 2 2+2γnL[f 2 n(θ n 1 ) f n(θ )] (θ n 1 θ ) θ n 1 θ 2 2 2γ n (θ n 1 θ ) f (θ n 1 ) +2γnE f 2 n(θ ) 2 2+2γnL[f 2 (θ n 1 ) 0] (θ n 1 θ ) θ n 1 θ 2 2 2γ n (1 γ n L)(θ n 1 θ ) f (θ n 1 )+2γnσ 2 2 θ n 1 θ 2 2 2γ n (1 γ n L) 1 2 µ θ n 1 θ 2 2+2γ 2 nσ 2 E [ θ n 1 θ 2 ] 2 = [ 1 µγ n (1 γ n L) ] θ n 1 θ 2 2+2γnσ 2 2 [ 1 µγ n (1 γ n L) ] E [ θ n 1 θ 2 ] 2 +2γ 2 n σ 2
80 Proof sketch (averaging) From Polyak and Juditsky (1992): θ n = θ n 1 γ n f n(θ n 1 ) f n(θ n 1 ) = 1 γ n (θ n 1 θ n ) f n(θ )+f n(θ )(θ n 1 θ ) = 1 γ n (θ n 1 θ n )+O( θ n 1 θ 2 ) f n(θ )+f (θ )(θ n 1 θ ) = 1 γ n (θ n 1 θ n )+O( θ n 1 θ 2 ) +O( θ n 1 θ )ε n θ n 1 θ = f (θ ) 1 f n(θ )+ 1 γ n f (θ ) 1 (θ n 1 θ n ) +O( θ n 1 θ 2 )+O( θ n 1 θ )ε n Averaging to cancel the term 1 γ n f (θ ) 1 (θ n 1 θ n )
81 Robustness to wrong constants for γ n = Cn α f(θ) = 1 2 θ 2 with i.i.d. Gaussian noise (d = 1) Left: α = 1/2 Right: α = 1 log[f(θ n ) f ] 5 0 α = 1/2 sgd C=1/5 ave C=1/5 sgd C=1 ave C=1 sgd C=5 ave C=5 log[f(θ n ) f ] 5 0 α = 1 sgd C=1/5 ave C=1/5 sgd C=1 ave C=1 sgd C=5 ave C= log(n) log(n) See also
82 Summary of new results (Bach and Moulines, 2011) Stochastic gradient descent with learning rate γ n = Cn α Strongly convex smooth objective functions Old: O(n 1 ) rate achieved without averaging for α = 1 New: O(n 1 ) rate achieved with averaging for α [1/2,1] Non-asymptotic analysis with explicit constants
83 Summary of new results (Bach and Moulines, 2011) Stochastic gradient descent with learning rate γ n = Cn α Strongly convex smooth objective functions Old: O(n 1 ) rate achieved without averaging for α = 1 New: O(n 1 ) rate achieved with averaging for α [1/2,1] Non-asymptotic analysis with explicit constants Non-strongly convex smooth objective functions Old: O(n 1/2 ) rate achieved with averaging for α = 1/2 New: O(max{n 1/2 3α/2,n α/2,n α 1 }) rate achieved without averaging for α [1/3,1] (worse than with averaging) Take-home message Use α = 1/2 with averaging to be adaptive to strong convexity
84 Beyond stochastic gradient method Adding a proximal step Goal: min θ R df(θ)+ω(θ) = Ef n(θ)+ω(θ) Replace recursion θ n = θ n 1 γ n f n(θ n ) by θ n = min θ R d θ θ n 1 +γ n f n(θ n ) 2 2 +CΩ(θ) Xiao (2010); Hu et al. (2009) May be accelerated (Ghadimi and Lan, 2013) Related frameworks Regularized dual averaging (Nesterov, 2009; Xiao, 2010) Mirror descent (Nemirovski et al., 2009; Lan et al., 2012)
85 Outline 1. Large-scale machine learning and optimization Traditional statistical analysis Classical methods for convex optimization 2. Non-smooth stochastic approximation Stochastic (sub)gradient and averaging Non-asymptotic results and lower bounds Strongly convex vs. non-strongly convex 3. Smooth stochastic approximation algorithms Asymptotic and non-asymptotic results Beyond decaying step-sizes 4. Finite data sets
86 Convex stochastic approximation Existing work Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Asymptotic analysis of averaging (Polyak and Juditsky, 1992; Ruppert, 1988) All step sizes γ n = Cn α with α (1/2,1) lead to O(n 1 ) for smooth strongly convex problems A single adaptive algorithm for smooth problems with convergence rate O(min{1/µn,1/ n}) in all situations?
87 Adaptive algorithm for logistic regression Logistic regression: (Φ(x n ),y n ) R d { 1,1} Single data point: f n (θ) = log(1+exp( y n θ Φ(x n ))) Generalization error: f(θ) = Ef n (θ)
88 Adaptive algorithm for logistic regression Logistic regression: (Φ(x n ),y n ) R d { 1,1} Single data point: f n (θ) = log(1+exp( y n θ Φ(x n ))) Generalization error: f(θ) = Ef n (θ) Cannot be strongly convex local strong convexity unless restricted to θ Φ(x n ) M (and with constants e M ) µ = lowest eigenvalue of the Hessian at the optimum f (θ ) logistic loss
89 Adaptive algorithm for logistic regression Logistic regression: (Φ(x n ),y n ) R d { 1,1} Single data point: f n (θ) = log(1+exp( y n θ Φ(x n ))) Generalization error: f(θ) = Ef n (θ) Cannot be strongly convex local strong convexity unless restricted to θ Φ(x n ) M (and with constants e M ) µ = lowest eigenvalue of the Hessian at the optimum f (θ ) n steps of averaged SGD with constant step-size 1/ ( 2R 2 n ) with R = radius of data (Bach, 2013): Ef( θ n ) f(θ ) min { } 1 n, R2 (15+5R θ0 θ ) 4 nµ Proof based on self-concordance (Nesterov and Nemirovski, 1994)
90 Self-concordance Usual definition for convex ϕ : R R: ϕ (t) 2ϕ (t) 3/2 Affine invariant Extendable to all convex functions on R d by looking at rays Used for the sharp proof of quadratic convergence of Newton method (Nesterov and Nemirovski, 1994) Generalized notion: ϕ (t) ϕ (t) Applicable to logistic regression (with extensions)
91 Self-concordance Usual definition for convex ϕ : R R: ϕ (t) 2ϕ (t) 3/2 Affine invariant Extendable to all convex functions on R d by looking at rays Used for the sharp proof of quadratic convergence of Newton method (Nesterov and Nemirovski, 1994) Generalized notion: ϕ (t) ϕ (t) Applicable to logistic regression (with extensions) Important properties Allows global Taylor expansions Relates expansions of derivatives of different orders
92 Adaptive algorithm for logistic regression Proof sketch Step 1: use existing result f( θ n ) f(θ )+ R2 n θ 0 θ 2 2 = O(1/ n) Step 2: f n(θ n 1 ) = 1 γ (θ n 1 θ n ) n 1 n k=1 f k (θ k 1) = 1 nγ (θ 0 θ n ) Step 3: f ( 1 n n k=1 θ ) k 1 1 n n k=1 f (θ k 1 ) 2 = O ( f( θ n ) f(θ ) ) = O(1/ n) using self-concordance Step 4a: if f µ-strongly convex, f( θ n ) f(θ ) 1 2µ f ( θ n ) 2 2 Step 4b: if f self-concordant, locally true with µ = λ min (f (θ ))
93 Adaptive algorithm for logistic regression Logistic regression: (Φ(x n ),y n ) R d { 1,1} Single data point: f n (θ) = log(1+exp( y n θ Φ(x n ))) Generalization error: f(θ) = Ef n (θ) Cannot be strongly convex local strong convexity unless restricted to θ Φ(x n ) M (and with constants e M ) µ = lowest eigenvalue of the Hessian at the optimum f (θ ) n steps of averaged SGD with constant step-size 1/ ( 2R 2 n ) with R = radius of data (Bach, 2013): Ef( θ n ) f(θ ) min { } 1 n, R2 (15+5R θ0 θ ) 4 nµ Proof based on self-concordance (Nesterov and Nemirovski, 1994)
94 Adaptive algorithm for logistic regression Logistic regression: (Φ(x n ),y n ) R d { 1,1} Single data point: f n (θ) = log(1+exp( y n θ Φ(x n ))) Generalization error: f(θ) = Ef n (θ) Cannot be strongly convex local strong convexity unless restricted to θ Φ(x n ) M (and with constants e M ) µ = lowest eigenvalue of the Hessian at the optimum f (θ ) n steps of averaged SGD with constant step-size 1/ ( 2R 2 n ) with R = radius of data (Bach, 2013): Ef( θ n ) f(θ ) min { } 1 n, R2 (15+5R θ0 θ ) 4 nµ A single adaptive algorithm for smooth problems with convergence rate O(1/n) in all situations?
95 Least-mean-square algorithm Least-squares: f(θ) = 1 2 E[ (y n Φ(x n ),θ ) 2] with θ R d SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) usually studied without averaging and decreasing step-sizes with strong convexity assumption E [ Φ(x n ) Φ(x n ) ] = H µ Id
96 Least-mean-square algorithm Least-squares: f(θ) = 1 2 E[ (y n Φ(x n ),θ ) 2] with θ R d SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) usually studied without averaging and decreasing step-sizes with strong convexity assumption E [ Φ(x n ) Φ(x n ) ] = H µ Id New analysis for averaging and constant step-size γ = 1/(4R 2 ) Assume Φ(x n ) R and y n Φ(x n ),θ σ almost surely No assumption regarding lowest eigenvalues of H Main result: Ef( θ n 1 ) f(θ ) 4σ2 d n + 4R2 θ 0 θ 2 n Matches statistical lower bound (Tsybakov, 2003) Non-asymptotic robust version of Györfi and Walk (1996)
97 Least-squares - Proof technique LMS recursion: θ n θ = [ I γφ(x n ) Φ(x n ) ] (θ n 1 θ )+γε n Φ(x n ) Simplified LMS recursion: with H = E [ Φ(x n ) Φ(x n ) ] θ n θ = [ I γh ] (θ n 1 θ )+γε n Φ(x n ) Direct proof technique of Polyak and Juditsky (1992), e.g., θ n θ = [ I γh ] n (θ0 θ )+γ n k=1 [ I γh ] n kεk Φ(x k ) Infinite expansion of Aguech, Moulines, and Priouret(2000) in powers of γ
98 Markov chain interpretation of constant step sizes LMS recursion for f n (θ) = 1 2 ( yn Φ(x n ),θ ) 2 θ n = θ n 1 γ ( Φ(x n ),θ n 1 y n ) Φ(xn ) The sequence (θ n ) n is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) - For least-squares, θ γ = θ θ n θ γ θ 0
99 Markov chain interpretation of constant step sizes LMS recursion for f n (θ) = 1 2 ( yn Φ(x n ),θ ) 2 θ n = θ n 1 γ ( Φ(x n ),θ n 1 y n ) Φ(xn ) The sequence (θ n ) n is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) For least-squares, θ γ = θ θ n θ θ 0
100 Markov chain interpretation of constant step sizes LMS recursion for f n (θ) = 1 2 ( yn Φ(x n ),θ ) 2 θ n = θ n 1 γ ( Φ(x n ),θ n 1 y n ) Φ(xn ) The sequence (θ n ) n is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) For least-squares, θ γ = θ θ n θ n θ θ 0
101 Markov chain interpretation of constant step sizes LMS recursion for f n (θ) = 1 2 ( yn Φ(x n ),θ ) 2 θ n = θ n 1 γ ( Φ(x n ),θ n 1 y n ) Φ(xn ) The sequence (θ n ) n is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) For least-squares, θ γ = θ θ n does not converge to θ but oscillates around it oscillations of order γ Ergodic theorem: Averaged iterates converge to θ γ = θ at rate O(1/n)
102 Simulations - synthetic examples Gaussian distributions - p = 20 0 synthetic square log 10 [f(θ) f(θ * )] /2R 2 1/8R 2 4 1/32R 2 1/2R 2 n 1/ log 10 (n)
103 Simulations - benchmarks alpha (p = 500, n = ), news (p = , n = ) log 10 [f(θ) f(θ * )] alpha square C=1 test 1/R 2 1/R 2 n 1/2 SAG log (n) alpha square C=opt test C/R 2 C/R 2 n 1/2 SAG log (n) news square C=1 test 0.2 news square C=opt test log 10 [f(θ) f(θ * )] /R 2 1/R 2 n 1/2 SAG log (n) C/R 2 C/R 2 n 1/2 SAG log 10 (n)
104 Beyond least-squares - Markov chain interpretation Recursion θ n = θ n 1 γf n(θ n 1 ) also defines a Markov chain Stationary distribution π γ such that f (θ)π γ (dθ) = 0 When f is not linear, f ( θπ γ (dθ)) f (θ)π γ (dθ) = 0
105 Beyond least-squares - Markov chain interpretation Recursion θ n = θ n 1 γf n(θ n 1 ) also defines a Markov chain Stationary distribution π γ such that f (θ)π γ (dθ) = 0 When f is not linear, f ( θπ γ (dθ)) f (θ)π γ (dθ) = 0 θ n oscillates around the wrong value θ γ θ θ n θ n θ γ θ 0 θ
106 Beyond least-squares - Markov chain interpretation Recursion θ n = θ n 1 γf n(θ n 1 ) also defines a Markov chain Stationary distribution π γ such that f (θ)π γ (dθ) = 0 When f is not linear, f ( θπ γ (dθ)) f (θ)π γ (dθ) = 0 θ n oscillates around the wrong value θ γ θ moreover, θ θ n = O p ( γ) Linear convergence up to the noise level for strongly-convex problems (Nedic and Bertsekas, 2000) Ergodic theorem averaged iterates converge to θ γ θ at rate O(1/n) moreover, θ θ γ = O(γ) (Bach, 2013)
107 Simulations - synthetic examples Gaussian distributions - p = 20 0 synthetic logistic 1 log 10 [f(θ) f(θ * )] 1 2 1/2R 2 3 1/8R 2 4 1/32R 2 5 1/2R 2 n 1/ log 10 (n)
108 Restoring convergence through online Newton steps Known facts 1. Averaged SGD with γ n n 1/2 leads to robust rate O(n 1/2 ) for all convex functions 2. Averaged SGD with γ n constant leads to robust rate O(n 1 ) for all convex quadratic functions 3. Newton s method squares the error at each iteration for smooth functions 4. A single step of Newton s method is equivalent to minimizing the quadratic Taylor expansion Online Newton step Rate: O((n 1/2 ) 2 +n 1 ) = O(n 1 ) Complexity: O(p) per iteration
109 Restoring convergence through online Newton steps Known facts 1. Averaged SGD with γ n n 1/2 leads to robust rate O(n 1/2 ) for all convex functions 2. Averaged SGD with γ n constant leads to robust rate O(n 1 ) for all convex quadratic functions O(n 1 ) 3. Newton s method squares the error at each iteration for smooth functions O((n 1/2 ) 2 ) 4. A single step of Newton s method is equivalent to minimizing the quadratic Taylor expansion Online Newton step Rate: O((n 1/2 ) 2 +n 1 ) = O(n 1 ) Complexity: O(p) per iteration
110 Restoring convergence through online Newton steps The Newton step for f = Ef n (θ) def = E [ l(y n, θ,φ(x n ) ) ] at θ is equivalent to minimizing the quadratic approximation g(θ) = f( θ)+ f ( θ),θ θ θ θ,f ( θ)(θ θ) = f( θ)+ Ef n( θ),θ θ θ θ,ef n( θ)(θ θ) [ = E f( θ)+ f n( θ),θ θ θ θ,f n( θ)(θ θ) ]
111 Restoring convergence through online Newton steps The Newton step for f = Ef n (θ) def = E [ l(y n, θ,φ(x n ) ) ] at θ is equivalent to minimizing the quadratic approximation g(θ) = f( θ)+ f ( θ),θ θ θ θ,f ( θ)(θ θ) = f( θ)+ Ef n( θ),θ θ θ θ,ef n( θ)(θ θ) [ = E f( θ)+ f n( θ),θ θ θ θ,f n( θ)(θ θ) ] Complexity of least-mean-square recursion for g is O(p) θ n = θ n 1 γ [ f n( θ)+f n( θ)(θ n 1 θ) ] f n( θ) = l (y n, θ,φ(x n ) )Φ(x n ) Φ(x n ) has rank one New online Newton step without computing/inverting Hessians
112 Choice of support point for online Newton step Two-stage procedure (1) Run n/2 iterations of averaged SGD to obtain θ (2) Run n/2 iterations of averaged constant step-size LMS Reminiscent of one-step estimators (see, e.g., Van der Vaart, 2000) Provable convergence rate of O(p/n) for logistic regression Additional assumptions but no strong convexity
113 Logistic regression - Proof technique Using generalized self-concordance of ϕ : u log(1+e u ): ϕ (u) ϕ (u) NB: difference with regular self-concordance: ϕ (u) 2ϕ (u) 3/2 Using novel high-probability convergence results for regular averaged stochastic gradient descent Requires assumption on the kurtosis in every direction, i.e., E Φ(x n ),η 4 κ [ E Φ(x n ),η 2] 2
114 Choice of support point for online Newton step Two-stage procedure (1) Run n/2 iterations of averaged SGD to obtain θ (2) Run n/2 iterations of averaged constant step-size LMS Reminiscent of one-step estimators (see, e.g., Van der Vaart, 2000) Provable convergence rate of O(p/n) for logistic regression Additional assumptions but no strong convexity Update at each iteration using the current averaged iterate Recursion: θ n = θ n 1 γ [ f n( θ n 1 )+f n( θ n 1 )(θ n 1 θ n 1 ) ] No provable convergence rate (yet) but best practical behavior Note (dis)similarity with regular SGD: θ n = θ n 1 γf n(θ n 1 )
115 Simulations - synthetic examples Gaussian distributions - p = 20 0 synthetic logistic 1 0 synthetic logistic 2 log 10 [f(θ) f(θ * )] /2R 2 3 1/8R 2 3 every 2 p 4 1/32R 2 every iter. 4 2 step 5 1/2R 2 n 1/2 5 2 step dbl log 10 (n) log 10 (n) log 10 [f(θ) f(θ * )] 1
116 Simulations - benchmarks alpha (p = 500, n = ), news (p = , n = ) alpha logistic C=1 test alpha logistic C=opt test log 10 [f(θ) f(θ * )] /R /R 2 n 1/2 SAG 2 Adagrad Newton log (n) C/R C/R 2 n 1/2 SAG 2 Adagrad Newton log (n) news logistic C=1 test 0.2 news logistic C=opt test log 10 [f(θ) f(θ * )] /R /R 2 n 1/2 SAG 0.8 Adagrad 1 Newton log (n) C/R C/R 2 n 1/2 SAG 0.8 Adagrad 1 Newton log 10 (n)
117 Optimal bounds for least-squares? Least-squares: cannot beat σ 2 d/n (Tsybakov, 2003). Really? Refined assumptions with adaptivity (Dieuleveut and Bach, 2014) f( θ n ) f(θ ) 16σ2 trσ 1/α Extension to non-parametric estimation Previous results: α = 0 and r = 1/2 n (γn) 1/α + 4 H1/2 r (θ 0 θ ) 2 γ 2r n 2min{r,1} Asymptotic analysis (Défossez and Bach, 2015) Bias vs. variance: which one is dominating? Acceleration (Flammarion and Bach, 2015)
118 Bias-variance decomposition (Défossez and Bach, 2015) Simplification: dominating term when n and γ 0 Variance (e.g., starting from the solution) f( θ n ) f(θ ) 1 ] [ε n E 2 Φ(x) H 1 Φ(x) NB: f noise ε is independent, then we obtain dσ2 n Exponentially decaying remainder terms(strongly convex problems) Bias (e.g., no noise) f( θ n ) f(θ ) 1 n 2 γ 2(θ 0 θ ) H 1 (θ 0 θ )
119 Bias-variance decomposition f( θn) f(θ ) (bias) γ = γ 0 /10 (var.) γ = γ 0 /10 (bias) γ = γ 0 (var.) γ = γ 0 (slope = -2) Iteration n
120 Acceleration (Flammarion and Bach, 2015) Existing results (Bach and Moulines, 2013) Variance = σ2 d n { R 2 θ 0 θ 2 Bias min, R4 θ 0 θ,h 1 (θ 0 θ ) } n n 2 Is it possible to get a bias term in R2 θ 0 θ 2 n 2? Corresponds to acceleration (Nesterov, 1983) Best (current) result: σ 2 d n 1 α + R2 θ 0 θ 2 n 1+α
121 Outline 1. Large-scale machine learning and optimization Traditional statistical analysis Classical methods for convex optimization 2. Non-smooth stochastic approximation Stochastic (sub)gradient and averaging Non-asymptotic results and lower bounds Strongly convex vs. non-strongly convex 3. Smooth stochastic approximation algorithms Asymptotic and non-asymptotic results Beyond decaying step-sizes 4. Finite data sets
122 Going beyond a single pass over the data Stochastic approximation Assumes infinite data stream Observations are used only once Directly minimizes testing cost E (x,y) l(y,θ Φ(x))
123 Going beyond a single pass over the data Stochastic approximation Assumes infinite data stream Observations are used only once Directly minimizes testing cost E (x,y) l(y,θ Φ(x)) Machine learning practice Finite data set (x 1,y 1,...,x n,y n ) Multiple passes Minimizes training cost 1 n n i=1 l(y i,θ Φ(x i )) Need to regularize (e.g., by the l 2 -norm) to avoid overfitting Goal: minimize g(θ) = 1 n n f i (θ) i=1
124 Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i,θ Φ(x i ) ) +µω(θ) i=1 Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n Linear (e.g., exponential) convergence rate in O(e αt ) Iteration complexity is linear in n (with line search) n f i(θ t 1 ) i=1
125 Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i,θ Φ(x i ) ) +µω(θ) i=1 Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n n f i(θ t 1 ) i=1
126 Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i,θ Φ(x i ) ) +µω(θ) i=1 Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n Linear (e.g., exponential) convergence rate in O(e αt ) Iteration complexity is linear in n (with line search) n f i(θ t 1 ) i=1 Stochastic gradient descent: θ t = θ t 1 γ t f i(t) (θ t 1) Sampling with replacement: i(t) random element of {1,...,n} Convergence rate in O(1/t) Iteration complexity is independent of n (step size selection?)
127 Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i,θ Φ(x i ) ) +µω(θ) i=1 Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n n f i(θ t 1 ) i=1 Stochastic gradient descent: θ t = θ t 1 γ t f i(t) (θ t 1)
128 Stochastic vs. deterministic methods Goal = best of both worlds: Linear rate with O(1) iteration cost Robustness to step size log(excess cost) stochastic deterministic time
129 Stochastic vs. deterministic methods Goal = best of both worlds: Linear rate with O(1) iteration cost Robustness to step size log(excess cost) hybrid time stochastic deterministic
Large-scale machine learning and convex optimization
Large-scale machine learning and convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE IFCAM, Bangalore - July 2014 Big data revolution? A new scientific
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,
More informationStochastic gradient descent and robustness to ill-conditioning
Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,
More informationStatistical machine learning and convex optimization
Statistical machine learning and convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Mastère M2 - Paris-Sud - Spring 2018 Slides available: www.di.ens.fr/~fbach/fbach_orsay_2018.pdf
More informationStochastic gradient methods for machine learning
Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - September 2012 Context Machine
More informationStochastic gradient methods for machine learning
Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine
More informationStatistical Optimality of Stochastic Gradient Descent through Multiple Passes
Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Loucas Pillaud-Vivien
More informationStochastic optimization: Beyond stochastic gradients and convexity Part I
Stochastic optimization: Beyond stochastic gradients and convexity Part I Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint tutorial with Suvrit Sra, MIT - NIPS
More informationOptimization for Machine Learning
Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationLecture 2 February 25th
Statistical machine learning and convex optimization 06 Lecture February 5th Lecturer: Francis Bach Scribe: Guillaume Maillard, Nicolas Brosse This lecture deals with classical methods for convex optimization.
More informationMinimizing Finite Sums with the Stochastic Average Gradient Algorithm
Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale
More informationLinearly-Convergent Stochastic-Gradient Methods
Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationProximal Minimization by Incremental Surrogate Optimization (MISO)
Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationStochastic Optimization
Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationAdaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ
More informationLearning with stochastic proximal gradient
Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and
More informationStochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure
Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationPerturbed Proximal Gradient Algorithm
Perturbed Proximal Gradient Algorithm Gersende FORT LTCI, CNRS, Telecom ParisTech Université Paris-Saclay, 75013, Paris, France Large-scale inverse problems and optimization Applications to image processing
More informationFast Stochastic Optimization Algorithms for ML
Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2
More informationMaking Gradient Descent Optimal for Strongly Convex Stochastic Optimization
Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania
More informationIncremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning
Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Julien Mairal Inria, LEAR Team, Grenoble Journées MAS, Toulouse Julien Mairal Incremental and Stochastic
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationStochastic Gradient Descent. CS 584: Big Data Analytics
Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration
More informationStochastic Optimization Part I: Convex analysis and online stochastic optimization
Stochastic Optimization Part I: Convex analysis and online stochastic optimization Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic
More informationCSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent
CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic
More informationTrade-Offs in Distributed Learning and Optimization
Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed
More informationOptimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for
More informationAsymptotic and finite-sample properties of estimators based on stochastic gradients
Asymptotic and finite-sample properties of estimators based on stochastic gradients Panagiotis (Panos) Toulis panos.toulis@chicagobooth.edu Econometrics and Statistics University of Chicago, Booth School
More informationModern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization
Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where
More informationOLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research
OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationAdaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade
Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random
More informationStochastic Gradient Descent. Ryan Tibshirani Convex Optimization
Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so
More informationStochastic Proximal Gradient Algorithm
Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind
More informationErgodic Subgradient Descent
Ergodic Subgradient Descent John Duchi, Alekh Agarwal, Mikael Johansson, Michael Jordan University of California, Berkeley and Royal Institute of Technology (KTH), Sweden Allerton Conference, September
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationOn Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression
On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression arxiv:606.03000v [cs.it] 9 Jun 206 Kobi Cohen, Angelia Nedić and R. Srikant Abstract The problem
More informationStochastic Composition Optimization
Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators
More informationLecture 1: Supervised Learning
Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)
More informationSVRG++ with Non-uniform Sampling
SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract
More informationStochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions
International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.
More informationConvergence Rates of Kernel Quadrature Rules
Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction
More informationEstimators based on non-convex programs: Statistical and computational guarantees
Estimators based on non-convex programs: Statistical and computational guarantees Martin Wainwright UC Berkeley Statistics and EECS Based on joint work with: Po-Ling Loh (UC Berkeley) Martin Wainwright
More informationOracle Complexity of Second-Order Methods for Smooth Convex Optimization
racle Complexity of Second-rder Methods for Smooth Convex ptimization Yossi Arjevani had Shamir Ron Shiff Weizmann Institute of Science Rehovot 7610001 Israel Abstract yossi.arjevani@weizmann.ac.il ohad.shamir@weizmann.ac.il
More informationOptimistic Rates Nati Srebro
Optimistic Rates Nati Srebro Based on work with Karthik Sridharan and Ambuj Tewari Examples based on work with Andy Cotter, Elad Hazan, Tomer Koren, Percy Liang, Shai Shalev-Shwartz, Ohad Shamir, Karthik
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationFull-information Online Learning
Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2
More informationBits of Machine Learning Part 1: Supervised Learning
Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification
More informationLinear classifiers: Overfitting and regularization
Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x
More informationSimple Optimization, Bigger Models, and Faster Learning. Niao He
Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More informationONLINE VARIANCE-REDUCING OPTIMIZATION
ONLINE VARIANCE-REDUCING OPTIMIZATION Nicolas Le Roux Google Brain nlr@google.com Reza Babanezhad University of British Columbia rezababa@cs.ubc.ca Pierre-Antoine Manzagol Google Brain manzagop@google.com
More informationStatistical Sparse Online Regression: A Diffusion Approximation Perspective
Statistical Sparse Online Regression: A Diffusion Approximation Perspective Jianqing Fan Wenyan Gong Chris Junchi Li Qiang Sun Princeton University Princeton University Princeton University University
More informationAlgorithmic Stability and Generalization Christoph Lampert
Algorithmic Stability and Generalization Christoph Lampert November 28, 2018 1 / 32 IST Austria (Institute of Science and Technology Austria) institute for basic research opened in 2009 located in outskirts
More informationAd Placement Strategies
Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationOptimal Regularized Dual Averaging Methods for Stochastic Optimization
Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business
More informationStochastic Optimization: First order method
Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO
More informationDay 3 Lecture 3. Optimizing deep networks
Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient
More informationWhy should you care about the solution strategies?
Optimization Why should you care about the solution strategies? Understanding the optimization approaches behind the algorithms makes you more effectively choose which algorithm to run Understanding the
More informationRandomized Smoothing for Stochastic Optimization
Randomized Smoothing for Stochastic Optimization John Duchi Peter Bartlett Martin Wainwright University of California, Berkeley NIPS Big Learn Workshop, December 2011 Duchi (UC Berkeley) Smoothing and
More informationConstrained Stochastic Gradient Descent for Large-scale Least Squares Problem
Constrained Stochastic Gradient Descent for Large-scale Least Squares Problem ABSRAC Yang Mu University of Massachusetts Boston 100 Morrissey Boulevard Boston, MA, US 015 yangmu@cs.umb.edu ianyi Zhou University
More informationNon-asymptotic bound for stochastic averaging
Non-asymptotic bound for stochastic averaging S. Gadat and F. Panloup Toulouse School of Economics PGMO Days, November, 14, 2017 I - Introduction I - 1 Motivations I - 2 Optimization I - 3 Stochastic Optimization
More informationTutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer
Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer Agenda 1. Learning as mathematical optimization Stochastic optimization, ERM,
More informationSummary and discussion of: Dropout Training as Adaptive Regularization
Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial
More informationProximal and First-Order Methods for Convex Optimization
Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationModern Optimization Techniques
Modern Optimization Techniques 2. Unconstrained Optimization / 2.2. Stochastic Gradient Descent Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University
More informationStochastic Subgradient Method
Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)
More informationAccelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization
Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationMachine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function
More informationOptimized first-order minimization methods
Optimized first-order minimization methods Donghwan Kim & Jeffrey A. Fessler EECS Dept., BME Dept., Dept. of Radiology University of Michigan web.eecs.umich.edu/~fessler UM AIM Seminar 2014-10-03 1 Disclosure
More informationBarzilai-Borwein Step Size for Stochastic Gradient Descent
Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong
More informationStochastic Analogues to Deterministic Optimizers
Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured
More informationIFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent
IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers
More informationLogistic Regression. William Cohen
Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting
More informationMore Optimization. Optimization Methods. Methods
More More Optimization Optimization Methods Methods Yann YannLeCun LeCun Courant CourantInstitute Institute http://yann.lecun.com http://yann.lecun.com (almost) (almost) everything everything you've you've
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationAccelerate Subgradient Methods
Accelerate Subgradient Methods Tianbao Yang Department of Computer Science The University of Iowa Contributors: students Yi Xu, Yan Yan and colleague Qihang Lin Yang (CS@Uiowa) Accelerate Subgradient Methods
More informationLogistic Regression. Stochastic Gradient Descent
Tutorial 8 CPSC 340 Logistic Regression Stochastic Gradient Descent Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1}
More informationOn the Generalization Ability of Online Strongly Convex Programming Algorithms
On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract
More informationSimultaneous Model Selection and Optimization through Parameter-free Stochastic Learning
Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Francesco Orabona Yahoo! Labs New York, USA francesco@orabona.com Abstract Stochastic gradient descent algorithms
More informationMinimizing finite sums with the stochastic average gradient
Math. Program., Ser. A (2017) 162:83 112 DOI 10.1007/s10107-016-1030-6 FULL LENGTH PAPER Minimizing finite sums with the stochastic average gradient Mark Schmidt 1 Nicolas Le Roux 2 Francis Bach 3 Received:
More informationMirror Descent for Metric Learning. Gautam Kunapuli Jude W. Shavlik
Mirror Descent for Metric Learning Gautam Kunapuli Jude W. Shavlik And what do we have here? We have a metric learning algorithm that uses composite mirror descent (COMID): Unifying framework for metric
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More information