Statistical Optimality of Stochastic Gradient Descent through Multiple Passes

Size: px
Start display at page:

Download "Statistical Optimality of Stochastic Gradient Descent through Multiple Passes"

Transcription

1 Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Loucas Pillaud-Vivien and Alessandro Rudi Newton Institute, Cambridge - June 2018

2 Two-minute summary Stochastic gradient descent for large-scale machine learning Processes observations one by one

3 Two-minute summary Stochastic gradient descent for large-scale machine learning Processes observations one by one Theory: Single pass SGD is optimal Only for easy problems Practice: Multiple pass SGD always works better

4 Two-minute summary Stochastic gradient descent for large-scale machine learning Processes observations one by one Theory: Single pass SGD is optimal Only for easy problems Practice: Multiple pass SGD always works better Provable for hard problems Quantification of required number of passes Optimal statistical performance Source and capacity conditions from kernel methods

5 Least-squares regression in finite dimension Data: n observations (x i,y i ) X R, i = 1,...,n, i.i.d. Prediction as linear functions θ,φ(x) of features Φ(x) H = R d

6 Least-squares regression in finite dimension Data: n observations (x i,y i ) X R, i = 1,...,n, i.i.d. Prediction as linear functions θ,φ(x) of features Φ(x) H = R d Optimal prediction θ H minimizing F(θ) = E(y θ,φ(x) ) 2

7 Least-squares regression in finite dimension Data: n observations (x i,y i ) X R, i = 1,...,n, i.i.d. Prediction as linear functions θ,φ(x) of features Φ(x) H = R d Optimal prediction θ H minimizing F(θ) = E(y θ,φ(x) ) 2 Assumption: Φ(x) R almost surely Assumption: y M and y θ,φ(x) σ almost surely

8 Least-squares regression in finite dimension Data: n observations (x i,y i ) X R, i = 1,...,n, i.i.d. Prediction as linear functions θ,φ(x) of features Φ(x) H = R d Optimal prediction θ H minimizing F(θ) = E(y θ,φ(x) ) 2 Assumption: Φ(x) R almost surely Assumption: y M and y θ,φ(x) σ almost surely Statistical performance of estimators ˆθ defined as EF(ˆθ) F(θ ) Finite dimension: optimal rate σ2 dim(h) n = σ2 d n Attained by empirical risk minimization (ERM) and SGD

9 Least-squares regression in finite dimension Data: n observations (x i,y i ) X R, i = 1,...,n, i.i.d. Prediction as linear functions θ,φ(x) of features Φ(x) H = R d Optimal prediction θ H minimizing F(θ) = E(y θ,φ(x) ) 2 Assumption: Φ(x) R almost surely Assumption: y M and y θ,φ(x) σ almost surely Statistical performance of estimators ˆθ defined as EF(ˆθ) F(θ ) Finite dimension: optimal rate σ2 dim(h) n = σ2 d n Attained by empirical risk minimization (ERM) and SGD What if n dim(h)? Needs assumptions on Σ = E [ Φ(x) Φ(x) ] and θ

10 Spectrum of covariance matrix Σ = E [ Φ(x) Φ(x) ] Eigenvalues λ m (Σ) (in decreasing order) Example: News dataset (d = , n = ) 8 6 log( λ m ) log(m)

11 Spectrum of covariance matrix Σ = E [ Φ(x) Φ(x) ] Eigenvalues λ m (Σ) (in decreasing order) Example: News dataset (d = , n = ) 8 6 log( λ m ) log(m) Assumption: tr(σ 1/α ) = m 1λ m (Σ) 1/α is small (compared to n) Equivalent to λ m (Σ) = O(m α )

12 Difficulty of the learning problem Measuring difficulty through the norm of θ Assumption: Σ 1/2 r θ is small (compared to n) r = 1/2: usual assumption on θ Larger r: simpler problems Smaller r: harder problems (r = 0 always true)

13 Difficulty of the learning problem Measuring difficulty through the norm of θ Assumption: Σ 1/2 r θ is small (compared to n) r = 1/2: usual assumption on θ Larger r: simpler problems Smaller r: harder problems (r = 0 always true)

14 Difficulty of the learning problem Measuring difficulty through the norm of θ Assumption: Σ 1/2 r θ is small (compared to n) r = 1/2: usual assumption on θ Larger r: simpler problems Smaller r: harder problems (r = 0 always true)

15 Optimal statistical performance Easy problems r α 1 2rα 2α : optimal rate is O(n 2rα+1)

16 Optimal statistical performance Easy problems r α 1 2rα 2α : optimal rate is O(n 2rα+1), achieved by: Regularized ERM (Caponnetto and De Vito, 2007) Early-stopped gradient descent (Yao et al., 2007) Single-pass averaged SGD (Dieuleveut and Bach, 2016)

17 Optimal statistical performance Easy problems r α 1 2rα 2α : optimal rate is O(n 2rα+1) Hard problems r α 1 2α Lower bound: O(n 2rα 2rα+1). Known upper bound: O(n 2r )

18 Least-mean-square (LMS) algorithm Least-squares: F(θ) = 1 2 E[ (y Φ(x),θ ) 2] with θ R d SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) Iteration: θ i = θ i 1 γ ( Φ(x i ),θ i 1 y i ) Φ(xi )

19 Least-mean-square (LMS) algorithm Least-squares: F(θ) = 1 2 E[ (y Φ(x),θ ) 2] with θ R d SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) Iteration: θ i = θ i 1 γ ( Φ(x i ),θ i 1 y i ) Φ(xi ) New analysis for averaging and constant step-size γ = 1/(4R 2 ) Bach and Moulines (2013) Assume Φ(x) R and y Φ(x),θ σ almost surely No assumption regarding lowest eigenvalues of Σ Main result: EF( θ n ) F(θ ) 4σ2 d n + 4R2 θ 0 θ 2 n Matches statistical lower bound (Tsybakov, 2003)

20 Markov chain interpretation of constant step sizes LMS recursion: θ i = θ i 1 γ ( Φ(x i ),θ i 1 y i ) Φ(xi )

21 Markov chain interpretation of constant step sizes LMS recursion: θ i = θ i 1 γ ( Φ(x i ),θ i 1 y i ) Φ(xi ) The sequence (θ i ) i is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) - For least-squares, θ γ = θ θ n θ γ θ 0

22 Markov chain interpretation of constant step sizes LMS recursion: θ i = θ i 1 γ ( Φ(x i ),θ i 1 y i ) Φ(xi ) The sequence (θ i ) i is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) For least-squares, θ γ = θ θ n θ γ θ 0

23 Markov chain interpretation of constant step sizes LMS recursion: θ i = θ i 1 γ ( Φ(x i ),θ i 1 y i ) Φ(xi ) The sequence (θ i ) i is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) For least-squares, θ γ = θ θ n θ n θ θ 0

24 Markov chain interpretation of constant step sizes LMS recursion: θ i = θ i 1 γ ( Φ(x i ),θ i 1 y i ) Φ(xi ) The sequence (θ i ) i is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) For least-squares, θ γ = θ θ n does not converge to θ but oscillates around it Ergodic theorem: Averaged iterates converge to θ γ = θ at rate O(1/n) See Dieuleveut, Durmus, and Bach (2017) for more details

25 Simulations - synthetic examples Gaussian distributions - d = 20 0 synthetic square log 10 [f(θ) f(θ * )] /2R 2 1/8R 2 4 1/32R 2 1/2R 2 n 1/ log 10 (n)

26 Simulations - benchmarks alpha (d = 500, n = ), news (d = , n = ) 1 alpha square C=1 test 0.2 news square C=1 test log 10 [f(θ) f(θ * )] /R 2 1/R 2 n 1/2 SAG log 10 [f(θ) f(θ * )] /R 2 1/R 2 n 1/2 SAG log 10 (n) log 10 (n)

27 Optimal bounds for least-squares? Least-squares: cannot beat σ 2 d/n (Tsybakov, 2003). Really? What if d n?

28 Optimal bounds for least-squares? Least-squares: cannot beat σ 2 d/n (Tsybakov, 2003). Really? What if d n? Needs assumptions on Σ = E [ Φ(x) Φ(x) ] and θ

29 Finer assumptions (Dieuleveut and Bach, 2016) Covariance eigenvalues Pessimistic assumption: all eigenvalues λ m less than a constant Actual decay as λ m = o(m α ) with trσ 1/α = λ 1/α m small m New result: replace σ2 d n by σ2 (γn) 1/α trσ 1/α n 8 6 log( λ m ) log(m)

30 Finer assumptions (Dieuleveut and Bach, 2016) Covariance eigenvalues Pessimistic assumption: all eigenvalues λ m less than a constant Actual decay as λ m = o(m α ) with trσ 1/α = λ 1/α m small m New result: replace σ2 d n by σ2 (γn) 1/α trσ 1/α n 8 40 log( λ m ) (γ n) 1/α Tr Σ 1/α / n log(m) alpha

31 Finer assumptions (Dieuleveut and Bach, 2016) Covariance eigenvalues Pessimistic assumption: all eigenvalues λ m less than a constant Actual decay as λ m = o(m α ) with trσ 1/α = λ 1/α m small m New result: replace σ2 d n by σ2 (γn) 1/α trσ 1/α n Optimal predictor Pessimistic assumption: θ 0 θ 2 finite/small Finer assumption: Σ 1/2 r (θ 0 θ ) 2 small, for r [0,1] Always satisfied for r = 0 and θ 0 = 0, since Σ 1/2 θ 2 Ey 2 n

32 Finer assumptions (Dieuleveut and Bach, 2016) Covariance eigenvalues Pessimistic assumption: all eigenvalues λ m less than a constant Actual decay as λ m = o(m α ) with trσ 1/α = λ 1/α m small m New result: replace σ2 d n by σ2 (γn) 1/α trσ 1/α n Optimal predictor Pessimistic assumption: θ 0 θ 2 finite/small Finer assumption: Σ 1/2 r (θ 0 θ ) 2 small, for r [0,1] Always satisfied for r = 0 and θ 0 = 0, since Σ 1/2 θ 2 Eyn 2 New result: replace θ 0 θ 2 by Σ1/2 r (θ 0 θ ) 2 γn γ 2r n 2r

33 Optimal bounds for least-squares? Least-squares: cannot beat σ 2 d/n (Tsybakov, 2003). Really? What if d n? Refined assumptions with adaptivity (Dieuleveut and Bach, 2016) Beyond strong convexity or lack thereof EF( θ n ) F(θ ) inf α 1,r [0,1] 4σ 2 trσ 1/α n (γn) 1/α + 4 Σ1/2 r θ 2 γ 2r n 2r Previous results: α = + and r = 1/2

34 Optimal bounds for least-squares? Least-squares: cannot beat σ 2 d/n (Tsybakov, 2003). Really? What if d n? Refined assumptions with adaptivity (Dieuleveut and Bach, 2016) Beyond strong convexity or lack thereof EF( θ n ) F(θ ) inf α 1,r [0,1] 4σ 2 trσ 1/α n (γn) 1/α + 4 Σ1/2 r θ 2 γ 2r n 2r Previous results: α = + and r = 1/2 Optimal step-size γ potentially decaying with n, but depends on usually unknown quantities α and r no adaptivity (yet)

35 Optimal bounds for least-squares? Least-squares: cannot beat σ 2 d/n (Tsybakov, 2003). Really? What if d n? Refined assumptions with adaptivity (Dieuleveut and Bach, 2016) Beyond strong convexity or lack thereof EF( θ n ) F(θ ) inf α 1,r [0,1] 4σ 2 trσ 1/α n (γn) 1/α + 4 Σ1/2 r θ 2 γ 2r n 2r Previous results: α = + and r = 1/2 Optimal step-size γ potentially decaying with n, but depends on usually unknown quantities α and r no adaptivity (yet) Extension to non-parametric estimation (using kernels) with optimal rates when r (α 1)/(2α), still with O(n 2 ) running-time

36 From least-squares to non-parametric estimation Extension to Hilbert spaces: Φ(x),θ H θ i = θ i 1 γ ( Φ(x i ),θ i 1 y i ) Φ(xi ) If θ 0 = 0, θ i is a linear combination of Φ(x 1 ),...,Φ(x i ) θ i = i k=1 a k Φ(x k ) and a i = γ i 1 k=1 a k Φ(x k ),Φ(x i ) +γy i Kernel trick: k(x,x ) = Φ(x),Φ(x ) Reproducing kernel Hilbert spaces and non parametric estimation See, e.g., Schölkopf and Smola (2001); Shawe-Taylor and Cristianini (2004); Dieuleveut and Bach (2016) Still O(n 2 ) overall running-time

37 From least-squares to non-parametric estimation Extension to Hilbert spaces: Φ(x),θ H θ i = θ i 1 γ ( Φ(x i ),θ i 1 y i ) Φ(xi ) If θ 0 = 0, θ i is a linear combination of Φ(x 1 ),...,Φ(x i ) θ i = i k=1 a k Φ(x k ) and a i = γ i 1 k=1 a k Φ(x k ),Φ(x i ) +γy i Kernel trick: k(x,x ) = Φ(x),Φ(x ) Reproducing kernel Hilbert spaces and non-parametric estimation See, e.g., Schölkopf and Smola (2001); Shawe-Taylor and Cristianini (2004); Dieuleveut and Bach (2016) Still O(n 2 ) overall running-time

38 Example: Sobolev spaces in one dimension X = [0, 1], functions represented through their Fourier series Weighted Fourier basis Φ(x) m = λ 1/2 m cos(2mπx) (plus sines) kernel k(x,x ) = m λ mcos [ 2mπ(x x ) ]

39 Example: Sobolev spaces in one dimension X = [0, 1], functions represented through their Fourier series Weighted Fourier basis Φ(x) m = λ 1/2 m cos(2mπx) (plus sines) kernel k(x,x ) = m λ mcos [ 2mπ(x x ) ] λ m m α corresponds to Sobolev penalty on f θ (x) = θ,φ(x) f θ 2 = θ 2 = m Fourier(f θ ) m 2 λ 1 m 1 0 f (α/2) θ (x) 2 dx

40 Example: Sobolev spaces in one dimension X = [0, 1], functions represented through their Fourier series Weighted Fourier basis Φ(x) m = λ 1/2 m cos(2mπx) (plus sines) kernel k(x,x ) = m λ mcos [ 2mπ(x x ) ] λ m m α corresponds to Sobolev penalty on f θ (x) = θ,φ(x) f θ 2 = θ 2 = m Fourier(f θ ) m 2 λ 1 m 1 0 f (α/2) θ (x) 2 dx Adapted norm Σ 1/2 r θ 2 depends on regularity of f θ Σ 1/2 r θ 2 = Fourier(f θ ) m 2 λ 2r m m Optimal rate is O(n 2rα+1) 2rα 1 0 f (rα) θ (x) 2 dx

41 New assumption needed Assumption: Σ µ/2 1/2 Φ(x) almost surely small Already used by Steinwart et al. (2009) True for µ = 1 Usually µ 1/α (equal for Sobolev spaces) Relationship between L norm L and RKHS norm g L = O( g µ g 1 µ L 2 ) NB: implies bounded leverage scores (Rudi et al., 2015)

42 Multiple pass SGD (sampling with replacement) Algorithm from n i.i.d. observations (x i,y i ), i = 1,...,n: θ u = θ u 1 +γ ( y i(u) θ u 1,Φ(x i(u) ) ) Φ(x i(u) ) θ t averaged iterate after t n iterations

43 Multiple pass SGD (sampling with replacement) Algorithm from n i.i.d. observations (x i,y i ), i = 1,...,n: θ u = θ u 1 +γ ( y i(u) θ u 1,Φ(x i(u) ) ) Φ(x i(u) ) θ t averaged iterate after t n iterations Theorem (Pillaud-Vivien, Rudi, and Bach, 2018): Assume r α 1 2α. If µ 2r, then after t = Θ(n α/(2rα+1) ) iterations, we have: EF( θ t ) F(θ ) = O(n 2rα/(2rα+1) ) Optimal Otherwise, then after t = Θ(n 1/µ (logn) 1 µ) iterations, we have: EF( θ t ) F(θ ) O(n 2r/µ ) Improved Proof technique following Rosasco and Villa (2015)

44 Proof sketch Algorithm from n i.i.d. observations (x i,y i ), i = 1,...,n: θ u = θ u 1 +γ ( y i(u) θ u 1,Φ(x i(u) ) ) Φ(x i(u) ) θ t averaged iterate after t n iterations Following Rosasco and Villa (2015), consider batch gradient recursion η u = θ u 1 + γ n ( yi θ u 1,Φ(x i ) ) Φ(x i ) n i=1 η t averaged iterate after t n iterations

45 Proof sketch Algorithm from n i.i.d. observations (x i,y i ), i = 1,...,n: θ u = θ u 1 +γ ( y i(u) θ u 1,Φ(x i(u) ) ) Φ(x i(u) ) θ t averaged iterate after t n iterations Following Rosasco and Villa (2015), consider batch gradient recursion η u = θ u 1 + γ n ( yi θ u 1,Φ(x i ) ) Φ(x i ) n i=1 η t averaged iterate after t n iterations As long as t = O(n 1/µ ) ( t 1/α) Property 1: EF( θ t ) EF( η t ) = O ( t t 1/α) Property 2: EF( η t ) F(θ ) = O +O(t 2r ) n

46 Multiple pass SGD (sampling with replacement) Algorithm from n i.i.d. observations (x i,y i ), i = 1,...,n: θ u = θ u 1 +γ ( y i(u) θ u 1,Φ(x i(u) ) ) Φ(x i(u) ) θ t averaged iterate after t n iterations Theorem (Pillaud-Vivien, Rudi, and Bach, 2018): Assume r α 1 2α. If µ 2r, then after t = Θ(n α/(2rα+1) ) iterations, we have: EF( θ t ) F(θ ) = O(n 2rα/(2rα+1) ) Optimal Otherwise, then after t = Θ(n 1/µ (logn) 1 µ) iterations, we have: EF( θ t ) F(θ ) O(n 2r/µ ) Improved Proof technique following Rosasco and Villa (2015)

47 Statistical optimality If µ 2r, then after t = Θ(n α/(2rα+1) ) iterations, we have: EF( θ t ) F(θ ) = O(n 2rα/(2rα+1) ) Optimal Otherwise, then after t = Θ(n 1/µ (logn) µ) 1 iterations, we have: EF( θ t ) F(θ ) O(n 2r/µ ) Improved

48 Simulations Synthetic examples One-dimensional kernel regression Sobolev spaces Arbitrary chosen values for r and α Check optimal number of iterations over the data

49 Simulations Synthetic examples One-dimensional kernel regression Sobolev spaces Arbitrary chosen values for r and α Check optimal number of iterations over the data Comparing three sampling schemes With replacement Without replacement (cycling with random reshuffling) Cycling

50 Simulations (sampling with replacement) α = 3/2,r = 1/3 > (α 1)/(2α) α = 4,r = 1/4 = (α 1)/(2α) line of slope 1 line of slope log 10 t * 3.0 log 10 t * log 10 n log 10 n α = 5/2,r = 1/5 < (α 1)/(2α) α = 3,r = 1/6 < (α 1)/(2α) line of slope 1.25 line of slope log 10 t * 3.0 log 10 t * log 10 n log 10 n

51 Simulations (sampling without replacement) α = 3/2,r = 1/3 > (α 1)/(2α) α = 4,r = 1/4 = (α 1)/(2α) line of slope 1 line of slope log 10 t * 3.0 log 10 t * log 10 n log 10 n α = 5/2,r = 1/5 < (α 1)/(2α) α = 3,r = 1/6 < (α 1)/(2α) line of slope 1.25 line of slope log 10 t * 3.0 log 10 t * log 10 n log 10 n

52 Simulations (cycling) α = 3/2,r = 1/3 > (α 1)/(2α) α = 4,r = 1/4 = (α 1)/(2α) line of slope 1 line of slope log 10 t * 3.0 log 10 t * log 10 n log 10 n α = 5/2,r = 1/5 < (α 1)/(2α) α = 3,r = 1/6 < (α 1)/(2α) line of slope 1.25 line of slope log 10 t * 3.0 log 10 t * log 10 n log 10 n

53 Simulations - Benchmarks MNIST dataset with linear kernel Optimal number of passes Number of samples

54 Benefits of multiple passes Conclusion Number of passes grows with sample size for hard problems First provable improvement of multiple passes over SGD [NB: Hardt et al. (2016); Lin and Rosasco (2017) consider small step-sizes]

55 Benefits of multiple passes Conclusion Number of passes grows with sample size for hard problems First provable improvement of multiple passes over SGD [NB: Hardt et al. (2016); Lin and Rosasco (2017) consider small step-sizes] Current work - Extensions Study of cycling and sampling without replacement (Shamir, 2016; Gürbüzbalaban et al., 2015) Mini-batches Beyond least-squares Optimal efficient algorithms for the situation µ > 2r Combining analysis with exponential convergence of testing errors (Pillaud-Vivien, Rudi, and Bach, 2017)

56 References F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). In Advances in Neural Information Processing Systems (NIPS), Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3): , A. Dieuleveut and F. Bach. Non-parametric Stochastic Approximation with Large Step sizes. Annals of Statistics, 44(4): , Aymeric Dieuleveut and Francis Bach. Nonparametric stochastic approximation with large step-sizes. The Annals of Statistics, 44(4): , Aymeric Dieuleveut, Alain Durmus, and Francis Bach. Bridging the gap between constant step size stochastic gradient descent and markov chains. Technical Report , arxiv, Mert Gürbüzbalaban, Asu Ozdaglar, and Pablo Parrilo. Why random reshuffling beats stochastic gradient descent. Technical Report , arxiv, Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning, Junhong Lin and Lorenzo Rosasco. Optimal rates for multi-pass stochastic gradient methods. Journal of Machine Learning Research, 18(97):1 47, O. Macchi. Adaptive processing: The least mean squares approach with applications in transmission. Wiley West Sussex, 1995.

57 L. Pillaud-Vivien, A. Rudi, and F. Bach. Stochastic gradient methods with exponential convergence of testing errors. Technical Report , arxiv, Loucas Pillaud-Vivien, Alessandro Rudi, and Francis Bach. Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes. Technical Report , arxiv, Lorenzo Rosasco and Silvia Villa. Learning with incremental iterative regularization. In Advances in Neural Information Processing Systems, pages , A. Rudi, R. Camoriano, and L. Rosasco. Less is more: Nyström computational regularization. In Advances in Neural Information Processing Systems, pages , B. Schölkopf and A. J. Smola. Learning with Kernels. MIT Press, Ohad Shamir. Without-replacement sampling for stochastic gradient methods. In Advances in Neural Information Processing Systems 29, pages 46 54, J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, Ingo Steinwart, Don R. Hush, and Clint Scovel. Optimal rates for regularized least squares regression. In Proc. COLT, A. B. Tsybakov. Optimal rates of aggregation. In Proc. COLT, Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent learning. Constructive Approximation, 26(2): , 2007.

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Stochastic gradient descent and robustness to ill-conditioning

Stochastic gradient descent and robustness to ill-conditioning Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro

More information

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro

More information

TUM 2016 Class 3 Large scale learning by regularization

TUM 2016 Class 3 Large scale learning by regularization TUM 2016 Class 3 Large scale learning by regularization Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x n, y n ) Beyond

More information

Large-scale machine learning and convex optimization

Large-scale machine learning and convex optimization Large-scale machine learning and convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Allerton Conference - September 2015 Slides available at www.di.ens.fr/~fbach/gradsto_allerton.pdf

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - September 2012 Context Machine

More information

Convergence Rates of Kernel Quadrature Rules

Convergence Rates of Kernel Quadrature Rules Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction

More information

Large-scale machine learning and convex optimization

Large-scale machine learning and convex optimization Large-scale machine learning and convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE IFCAM, Bangalore - July 2014 Big data revolution? A new scientific

More information

Optimal kernel methods for large scale learning

Optimal kernel methods for large scale learning Optimal kernel methods for large scale learning Alessandro Rudi INRIA - École Normale Supérieure, Paris joint work with Luigi Carratino, Lorenzo Rosasco 6 Mar 2018 École Polytechnique Learning problem

More information

Optimal Rates for Multi-pass Stochastic Gradient Methods

Optimal Rates for Multi-pass Stochastic Gradient Methods Journal of Machine Learning Research 8 (07) -47 Submitted 3/7; Revised 8/7; Published 0/7 Optimal Rates for Multi-pass Stochastic Gradient Methods Junhong Lin Laboratory for Computational and Statistical

More information

Oslo Class 2 Tikhonov regularization and kernels

Oslo Class 2 Tikhonov regularization and kernels RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n

More information

Approximate Kernel Methods

Approximate Kernel Methods Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine

More information

Mathematical Methods for Data Analysis

Mathematical Methods for Data Analysis Mathematical Methods for Data Analysis Massimiliano Pontil Istituto Italiano di Tecnologia and Department of Computer Science University College London Massimiliano Pontil Mathematical Methods for Data

More information

Online Gradient Descent Learning Algorithms

Online Gradient Descent Learning Algorithms DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London Introduction Outline

More information

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Francesco Orabona Yahoo! Labs New York, USA francesco@orabona.com Abstract Stochastic gradient descent algorithms

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

RegML 2018 Class 2 Tikhonov regularization and kernels

RegML 2018 Class 2 Tikhonov regularization and kernels RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,

More information

Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms

Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms Junhong Lin Volkan Cevher Laboratory for Information and Inference Systems École Polytechnique Fédérale

More information

Approximate Kernel PCA with Random Features

Approximate Kernel PCA with Random Features Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,

More information

On non-parametric robust quantile regression by support vector machines

On non-parametric robust quantile regression by support vector machines On non-parametric robust quantile regression by support vector machines Andreas Christmann joint work with: Ingo Steinwart (Los Alamos National Lab) Arnout Van Messem (Vrije Universiteit Brussel) ERCIM

More information

Learning sets and subspaces: a spectral approach

Learning sets and subspaces: a spectral approach Learning sets and subspaces: a spectral approach Alessandro Rudi DIBRIS, Università di Genova Optimization and dynamical processes in Statistical learning and inverse problems Sept 8-12, 2014 A world of

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Perturbed Proximal Gradient Algorithm

Perturbed Proximal Gradient Algorithm Perturbed Proximal Gradient Algorithm Gersende FORT LTCI, CNRS, Telecom ParisTech Université Paris-Saclay, 75013, Paris, France Large-scale inverse problems and optimization Applications to image processing

More information

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee Distribution Regression: A Simple Technique with Minimax-optimal Guarantee (CMAP, École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department,

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley Collaborators Joint work with Samy Bengio, Moritz Hardt, Michael Jordan, Jason Lee, Max Simchowitz,

More information

Stochastic optimization: Beyond stochastic gradients and convexity Part I

Stochastic optimization: Beyond stochastic gradients and convexity Part I Stochastic optimization: Beyond stochastic gradients and convexity Part I Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint tutorial with Suvrit Sra, MIT - NIPS

More information

Kernel Learning via Random Fourier Representations

Kernel Learning via Random Fourier Representations Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

An inverse problem perspective on machine learning

An inverse problem perspective on machine learning An inverse problem perspective on machine learning Lorenzo Rosasco University of Genova Massachusetts Institute of Technology Istituto Italiano di Tecnologia lcsl.mit.edu Feb 9th, 2018 Inverse Problems

More information

Hierarchical kernel learning

Hierarchical kernel learning Hierarchical kernel learning Francis Bach Willow project, INRIA - Ecole Normale Supérieure May 2010 Outline Supervised learning and regularization Kernel methods vs. sparse methods MKL: Multiple kernel

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Early Stopping for Computational Learning

Early Stopping for Computational Learning Early Stopping for Computational Learning Lorenzo Rosasco Universita di Genova, Massachusetts Institute of Technology Istituto Italiano di Tecnologia CBMM Sestri Levante, September, 2014 joint work with

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 04: Features and Kernels Lorenzo Rosasco Linear functions Let H lin be the space of linear functions f(x) = w x. f w is one

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces 9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

More information

Optimal Distributed Learning with Multi-pass Stochastic Gradient Methods

Optimal Distributed Learning with Multi-pass Stochastic Gradient Methods Optimal Distributed Learning with Multi-pass Stochastic Gradient Methods Junhong Lin Volkan Cevher Abstract We study generalization properties of distributed algorithms in the setting of nonparametric

More information

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari Machine Learning Basics: Stochastic Gradient Descent Sargur N. srihari@cedar.buffalo.edu 1 Topics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation Sets

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010 Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X

More information

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania

More information

Stochastic Gradient Descent. CS 584: Big Data Analytics

Stochastic Gradient Descent. CS 584: Big Data Analytics Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration

More information

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

Generalization Bounds for Randomized Learning with Application to Stochastic Gradient Descent

Generalization Bounds for Randomized Learning with Application to Stochastic Gradient Descent Generalization Bounds for Randomized Learning with Application to Stochastic Gradient Descent Ben London blondon@amazon.com 1 Introduction Randomized algorithms are central to modern machine learning.

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Stochastic Analogues to Deterministic Optimizers

Stochastic Analogues to Deterministic Optimizers Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured

More information

dans les modèles à vraisemblance non explicite par des algorithmes gradient-proximaux perturbés

dans les modèles à vraisemblance non explicite par des algorithmes gradient-proximaux perturbés Inférence pénalisée dans les modèles à vraisemblance non explicite par des algorithmes gradient-proximaux perturbés Gersende Fort Institut de Mathématiques de Toulouse, CNRS and Univ. Paul Sabatier Toulouse,

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

Iterate Averaging as Regularization for Stochastic Gradient Descent

Iterate Averaging as Regularization for Stochastic Gradient Descent Proceedings of Machine Learning Research vol 75: 2, 208 3st Annual Conference on Learning Theory Iterate Averaging as Regularization for Stochastic Gradient Descent Gergely Neu Universitat Pompeu Fabra,

More information

Linearly-Convergent Stochastic-Gradient Methods

Linearly-Convergent Stochastic-Gradient Methods Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure

More information

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel) Diffeomorphic Warping Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel) What Manifold Learning Isn t Common features of Manifold Learning Algorithms: 1-1 charting Dense sampling Geometric Assumptions

More information

Regularization via Spectral Filtering

Regularization via Spectral Filtering Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8 Introduction Let x R d

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Convex relaxation for Combinatorial Penalties

Convex relaxation for Combinatorial Penalties Convex relaxation for Combinatorial Penalties Guillaume Obozinski Equipe Imagine Laboratoire d Informatique Gaspard Monge Ecole des Ponts - ParisTech Joint work with Francis Bach Fête Parisienne in Computation,

More information

Advanced computational methods X Selected Topics: SGD

Advanced computational methods X Selected Topics: SGD Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety

More information

Spectral Regularization

Spectral Regularization Spectral Regularization Lorenzo Rosasco 9.520 Class 07 February 27, 2008 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

Approximation Theoretical Questions for SVMs

Approximation Theoretical Questions for SVMs Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Simple Optimization, Bigger Models, and Faster Learning. Niao He Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture

More information

Towards stability and optimality in stochastic gradient descent

Towards stability and optimality in stochastic gradient descent Towards stability and optimality in stochastic gradient descent Panos Toulis, Dustin Tran and Edoardo M. Airoldi August 26, 2016 Discussion by Ikenna Odinaka Duke University Outline Introduction 1 Introduction

More information

Robust Support Vector Machines for Probability Distributions

Robust Support Vector Machines for Probability Distributions Robust Support Vector Machines for Probability Distributions Andreas Christmann joint work with Ingo Steinwart (Los Alamos National Lab) ICORS 2008, Antalya, Turkey, September 8-12, 2008 Andreas Christmann,

More information

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization Dimitri P. Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology February 2014

More information

Distribution Regression

Distribution Regression Zoltán Szabó (École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) Dagstuhl Seminar 16481

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Distributed Learning with Regularized Least Squares

Distributed Learning with Regularized Least Squares Journal of Machine Learning Research 8 07-3 Submitted /5; Revised 6/7; Published 9/7 Distributed Learning with Regularized Least Squares Shao-Bo Lin sblin983@gmailcom Department of Mathematics City University

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

OPTIMIZATION METHODS IN DEEP LEARNING

OPTIMIZATION METHODS IN DEEP LEARNING Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate

More information

Optimization and Gradient Descent

Optimization and Gradient Descent Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function

More information

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017 Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models & Applications Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with

More information

Nonlinear Optimization Methods for Machine Learning

Nonlinear Optimization Methods for Machine Learning Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

Basis Expansion and Nonlinear SVM. Kai Yu

Basis Expansion and Nonlinear SVM. Kai Yu Basis Expansion and Nonlinear SVM Kai Yu Linear Classifiers f(x) =w > x + b z(x) = sign(f(x)) Help to learn more general cases, e.g., nonlinear models 8/7/12 2 Nonlinear Classifiers via Basis Expansion

More information

Gradient Descent. Sargur Srihari

Gradient Descent. Sargur Srihari Gradient Descent Sargur srihari@cedar.buffalo.edu 1 Topics Simple Gradient Descent/Ascent Difficulties with Simple Gradient Descent Line Search Brent s Method Conjugate Gradient Descent Weight vectors

More information

CSC321 Lecture 9: Generalization

CSC321 Lecture 9: Generalization CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 27 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions

More information

Statistically and Computationally Efficient Variance Estimator for Kernel Ridge Regression

Statistically and Computationally Efficient Variance Estimator for Kernel Ridge Regression Statistically and Computationally Efficient Variance Estimator for Kernel Ridge Regression Meimei Liu Department of Statistical Science Duke University Durham, IN - 27708 Email: meimei.liu@duke.edu Jean

More information

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Lecture 3: Minimizing Large Sums. Peter Richtárik

Lecture 3: Minimizing Large Sums. Peter Richtárik Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors

More information

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know The Bayes classifier Theorem The classifier satisfies where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know Alternatively, since the maximum it is

More information

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA The Bayes classifier Linear discriminant analysis (LDA) Theorem The classifier satisfies In linear discriminant analysis (LDA), we make the (strong) assumption that where the min is over all possible classifiers.

More information

Scaling up Vector Autoregressive Models with Operator Random Fourier Features

Scaling up Vector Autoregressive Models with Operator Random Fourier Features Scaling up Vector Autoregressive Models with Operator Random Fourier Features Romain Brault, Néhémy Lim, Florence d Alché-Buc August, 6 Abstract A nonparametric approach to Vector Autoregressive Modeling

More information