Stochastic gradient descent and robustness to ill-conditioning

Size: px
Start display at page:

Download "Stochastic gradient descent and robustness to ill-conditioning"

Transcription

1 Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion, Eric Moulines - Venezia, May 2016

2 Big data revolution? A new scientific context Data everywhere: size does not (always) matter Science and industry Size and variety Learning from examples n observations in dimension p

3 Search engines - Advertising

4 Visual object recognition

5 Bioinformatics Protein: Crucial elements of cell life Massive data: 2 millions for humans Complex data

6 Context Machine learning for big data Large-scale machine learning: large p, large n p : dimension of each observation (input) n : number of observations Examples: computer vision, bioinformatics, advertising Ideal running-time complexity: O(pn) Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951) Mixing statistics and optimization Using smoothness to go beyond stochastic gradient descent

7 Context Machine learning for big data Large-scale machine learning: large p, large n p : dimension of each observation (input) n : number of observations Examples: computer vision, bioinformatics, advertising Ideal running-time complexity: O(pn) Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951) Mixing statistics and optimization Using smoothness to go beyond stochastic gradient descent

8 Context Machine learning for big data Large-scale machine learning: large p, large n p : dimension of each observation (input) n : number of observations Examples: computer vision, bioinformatics, advertising Ideal running-time complexity: O(pn) Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951) Mixing statistics and optimization Using smoothness to go beyond stochastic gradient descent

9 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ,φ(x) of features Φ(x) R p Explicit features adapted to inputs (can be learned as well) Using Hilbert spaces for non-linear / non-parametric estimation

10 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ,φ(x) of features Φ(x) R p (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i, θ,φ(x i ) ) + µω(θ) θ R p n i=1 convex data fitting term + regularizer

11 Usual losses Regression: y R, prediction ŷ = θ,φ(x) quadratic loss 1 2 (y ŷ)2 = 1 2 (y θ,φ(x) )2

12 Usual losses Regression: y R, prediction ŷ = θ,φ(x) quadratic loss 1 2 (y ŷ)2 = 1 2 (y θ,φ(x) )2 Classification : y { 1,1}, prediction ŷ = sign( θ,φ(x) ) loss of the form l(y θ,φ(x) ) True 0-1 loss: l(y θ,φ(x) ) = 1 y θ,φ(x) <0 Usual convex losses: hinge square logistic

13 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ,φ(x) of features Φ(x) R p (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i, θ,φ(x i ) ) + µω(θ) θ R p n i=1 convex data fitting term + regularizer

14 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ,φ(x) of features Φ(x) R p (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i, θ,φ(x i ) ) + µω(θ) θ R p n i=1 convex data fitting term + regularizer Empirical risk: ˆf(θ) = 1 n n i=1 l(y i, θ,φ(x i ) ) training cost Expected risk: f(θ) = E (x,y) l(y, θ,φ(x) ) testing cost Two fundamental questions: (1) computing ˆθ and (2) analyzing ˆθ

15 Smoothness and strong convexity A function g : R p R is L-smooth if and only if it is twice differentiable and θ R p, eigenvalues [ g (θ) ] L smooth non smooth

16 Smoothness and strong convexity A function g : R p R is L-smooth if and only if it is twice differentiable and θ R p, eigenvalues [ g (θ) ] L Machine learning with g(θ) = 1 n n i=1 l(y i, θ,φ(x i ) ) Hessian covariance matrix 1 n n i=1 Φ(x i) Φ(x i ) Bounded data: Φ(x) R L = O(R 2 )

17 Smoothness and strong convexity A twice differentiable function g : R p R is µ-strongly convex if and only if θ R p, eigenvalues [ g (θ) ] µ convex strongly convex

18 Smoothness and strong convexity A twice differentiable function g : R p R is µ-strongly convex if and only if θ R p, eigenvalues [ g (θ) ] µ (large µ/l) (small µ/l)

19 Smoothness and strong convexity A twice differentiable function g : R p R is µ-strongly convex if and only if θ R p, eigenvalues [ g (θ) ] µ Machine learning with g(θ) = 1 n n i=1 l(y i, θ,φ(x i ) ) Hessian covariance matrix 1 n n i=1 Φ(x i) Φ(x i ) Data with invertible covariance matrix (low correlation/dimension)

20 Smoothness and strong convexity A twice differentiable function g : R p R is µ-strongly convex if and only if θ R p, eigenvalues [ g (θ) ] µ Machine learning with g(θ) = 1 n n i=1 l(y i, θ,φ(x i ) ) Hessian covariance matrix 1 n n i=1 Φ(x i) Φ(x i ) Data with invertible covariance matrix (low correlation/dimension) Adding regularization by µ 2 θ 2 creates additional bias unless µ is small

21 Iterative methods for minimizing smooth functions Assumption: g convex and smooth on R p Gradient descent: θ t = θ t 1 γ t g (θ t 1 ) O(1/t) convergence rate for convex functions O(e (µ/l)t ) convergence rate for strongly convex functions

22 Iterative methods for minimizing smooth functions Assumption: g convex and smooth on R p Gradient descent: θ t = θ t 1 γ t g (θ t 1 ) O(1/t) convergence rate for convex functions O(e (µ/l)t ) convergence rate for strongly convex functions Newton method: θ t = θ t 1 g (θ t 1 ) 1 g (θ t 1 ) O ( e ρ2t ) convergence rate

23 Iterative methods for minimizing smooth functions Assumption: g convex and smooth on R p Gradient descent: θ t = θ t 1 γ t g (θ t 1 ) O(1/t) convergence rate for convex functions O(e (µ/l)t ) convergence rate for strongly convex functions Newton method: θ t = θ t 1 g (θ t 1 ) 1 g (θ t 1 ) O ( e ρ2t ) convergence rate Key insights from Bottou and Bousquet (2008) 1. In machine learning, no need to optimize below statistical error 2. In machine learning, cost functions are averages Stochastic approximation

24 Stochastic approximation Goal: Minimizing a function f defined on R p given only unbiased estimates f n(θ n ) of its gradients f (θ n ) at certain points θ n R p

25 Stochastic approximation Goal: Minimizing a function f defined on R p given only unbiased estimates f n(θ n ) of its gradients f (θ n ) at certain points θ n R p Machine learning - statistics f(θ) = Ef n (θ) = El(y n, θ,φ(x n ) ) = generalization error Loss for a single pair of observations: f n (θ) = l(y n, θ,φ(x n ) ) Expected gradient: f (θ) = Ef n(θ) = E { l (y n, θ,φ(x n ) )Φ(x n ) } Beyond convex optimization: see, e.g., Benveniste et al. (2012)

26 Convex stochastic approximation Key assumption: smoothness and/or strong convexity Key algorithm: stochastic gradient descent (a.k.a. Robbins-Monro) θ n = θ n 1 γ n f n(θ n 1 ) Polyak-Ruppert averaging: θ n = 1 n+1 n k=0 θ k Which learning rate sequence γ n? Classical setting: γ n = Cn α

27 Convex stochastic approximation Key assumption: smoothness and/or strong convexity Key algorithm: stochastic gradient descent (a.k.a. Robbins-Monro) θ n = θ n 1 γ n f n(θ n 1 ) Polyak-Ruppert averaging: θ n = 1 n+1 n k=0 θ k Which learning rate sequence γ n? Classical setting: γ n = Cn α Running-time = O(np) Single pass through the data One line of code among many

28 Convex stochastic approximation Existing analysis Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2

29 Convex stochastic approximation Existing analysis Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Asymptotic analysis of averaging (Polyak and Juditsky, 1992; Ruppert, 1988) All step sizes γ n = Cn α with α (1/2,1) lead to O(n 1 ) for smooth strongly convex problems A single algorithm with global adaptive convergence rate for smooth problems?

30 Convex stochastic approximation Existing analysis Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Asymptotic analysis of averaging (Polyak and Juditsky, 1992; Ruppert, 1988) All step sizes γ n = Cn α with α (1/2,1) lead to O(n 1 ) for smooth strongly convex problems A single algorithm for smooth problems with global convergence rate O(1/n) in all situations?

31 Least-mean-square algorithm Least-squares: f(θ) = 1 2 E[ (y n Φ(x n ),θ ) 2] with θ R p SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) usually studied without averaging and decreasing step-sizes with strong convexity assumption E [ Φ(x n ) Φ(x n ) ] = H µ Id

32 Least-mean-square algorithm Least-squares: f(θ) = 1 2 E[ (y n Φ(x n ),θ ) 2] with θ R p SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) usually studied without averaging and decreasing step-sizes with strong convexity assumption E [ Φ(x n ) Φ(x n ) ] = H µ Id New analysis for averaging and constant step-size γ = 1/(4R 2 ) Assume Φ(x n ) R and y n Φ(x n ),θ σ almost surely No assumption regarding lowest eigenvalues of H Main result: Ef( θ n ) f(θ ) 4σ2 p n + 4R2 θ 0 θ 2 n Matches statistical lower bound (Tsybakov, 2003) Non-asymptotic robust version of Györfi and Walk (1996)

33 Markov chain interpretation of constant step sizes LMS recursion for f n (θ) = 1 2 ( yn Φ(x n ),θ ) 2 θ n = θ n 1 γ ( Φ(x n ),θ n 1 y n ) Φ(xn ) The sequence (θ n ) n is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) - For least-squares, θ γ = θ θ n θ γ θ 0

34 Markov chain interpretation of constant step sizes LMS recursion for f n (θ) = 1 2 ( yn Φ(x n ),θ ) 2 θ n = θ n 1 γ ( Φ(x n ),θ n 1 y n ) Φ(xn ) The sequence (θ n ) n is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) For least-squares, θ γ = θ θ n θ θ 0

35 Markov chain interpretation of constant step sizes LMS recursion for f n (θ) = 1 2 ( yn Φ(x n ),θ ) 2 θ n = θ n 1 γ ( Φ(x n ),θ n 1 y n ) Φ(xn ) The sequence (θ n ) n is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) For least-squares, θ γ = θ θ n θ n θ θ 0

36 Markov chain interpretation of constant step sizes LMS recursion for f n (θ) = 1 2 ( yn Φ(x n ),θ ) 2 θ n = θ n 1 γ ( Φ(x n ),θ n 1 y n ) Φ(xn ) The sequence (θ n ) n is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) For least-squares, θ γ = θ θ n does not converge to θ but oscillates around it oscillations of order γ Ergodic theorem: Averaged iterates converge to θ γ = θ at rate O(1/n)

37 Simulations - synthetic examples Gaussian distributions - p = 20 0 synthetic square log 10 [f(θ) f(θ * )] /2R 2 1/8R 2 4 1/32R 2 1/2R 2 n 1/ log 10 (n)

38 Simulations - benchmarks alpha (p = 500, n = ), news (p = , n = ) log 10 [f(θ) f(θ * )] alpha square C=1 test 1/R 2 1/R 2 n 1/2 SAG log (n) alpha square C=opt test C/R 2 C/R 2 n 1/2 SAG log (n) news square C=1 test 0.2 news square C=opt test log 10 [f(θ) f(θ * )] /R 2 1/R 2 n 1/2 SAG log (n) C/R 2 C/R 2 n 1/2 SAG log 10 (n)

39 Isn t least-squares regression a regression?

40 Isn t least-squares regression a regression? Least-squares regression Simpler to analyze and understand Explicit relationship to bias/variance trade-offs (next slides) Many important loss functions are not quadratic Beyond least-squares with online Newton steps Complexity of O(p) per iteration with rate O(p/n) See Bach and Moulines (2013) for details

41 Optimal bounds for least-squares? Least-squares: cannot beat σ 2 p/n (Tsybakov, 2003). Really?

42 Optimal bounds for least-squares? Least-squares: cannot beat σ 2 p/n (Tsybakov, 2003). Really? Refined analysis (Défossez and Bach, 2015) Ef( θ n ) f(θ ) σ2 p n + H 1/2 (θ 0 θ ) 2 2 γ 2 n 2 In practice: bias may be larger than variance, σ 2 p/n pessimistic

43 Optimal bounds for least-squares? Least-squares: cannot beat σ 2 p/n (Tsybakov, 2003). Really? Refined analysis (Défossez and Bach, 2015) Ef( θ n ) f(θ ) σ2 p n + H 1/2 (θ 0 θ ) 2 2 γ 2 n 2 In practice: bias may be larger than variance, σ 2 p/n pessimistic Refined assumptions with adaptivity (Dieuleveut and Bach, 2014) Ef( θ n ) f(θ ) σ2 γ 1/α trh 1/α + H1/2 r (θ 0 θ ) 2 2 n 1 1/α γ 2r n 2min{r,1} SGD is adaptive to the covariance matrix eigenvalue decay Leads to optimal rates for non-parametric regression

44 Achieving optimal bias and variance terms Current results with averaged SGD (ill-conditioned problems) Variance (starting from optimal θ ) = σ2 p n { R 2 θ 0 θ 2 Bias(nonoise)= min, R4 θ 0 θ,h 1 (θ 0 θ ) } n n 2

45 Achieving optimal bias and variance terms Current results with averaged SGD (ill-conditioned problems) Variance (starting from optimal θ ) = σ2 p n { R 2 θ 0 θ 2 Bias(nonoise)= min, R4 θ 0 θ,h 1 (θ 0 θ ) } n n 2

46 Achieving optimal bias and variance terms Current results with averaged SGD (ill-conditioned problems) Variance (starting from optimal θ ) = σ2 p n { R 2 θ 0 θ 2 Bias(nonoise)= min, R4 θ 0 θ,h 1 (θ 0 θ ) } n n 2 Averaged gradient descent (Bach and Moulines, 2013) Bias R 2 θ 0 θ 2 n Variance σ 2 p n

47 Achieving optimal bias and variance terms Averaged gradient descent (Bach and Moulines, 2013) Bias R 2 θ 0 θ 2 n Variance σ 2 p n

48 Achieving optimal bias and variance terms Averaged gradient descent (Bach and Moulines, 2013) Accelerated gradient descent (Nesterov, 1983) Bias R 2 θ 0 θ 2 n R 2 θ 0 θ 2 n 2 Variance σ 2 p n σ 2 p Acceleration is notoriously non-robust to noise (d Aspremont, 2008; Schmidt et al., 2011) For non-structured noise, see Lan (2012)

49 Achieving optimal bias and variance terms Averaged gradient descent (Bach and Moulines, 2013) Accelerated gradient descent (Nesterov, 1983) Between averaging and acceleration (Flammarion and Bach, 2015) Bias R 2 θ 0 θ 2 n R 2 θ 0 θ 2 n 2 R 2 θ 0 θ 2 n 1+α Variance σ 2 p n σ 2 p σ 2 p n 1 α

50 Achieving optimal bias and variance terms Averaged gradient descent (Bach and Moulines, 2013) Accelerated gradient descent (Nesterov, 1983) Between averaging and acceleration (Flammarion and Bach, 2015) Averaging and acceleration (Dieuleveut, Flammarion, and Bach, 2016) Bias R 2 θ 0 θ 2 n R 2 θ 0 θ 2 n 2 R 2 θ 0 θ 2 n 1+α R 2 θ 0 θ 2 n 2 Variance σ 2 p n σ 2 p σ 2 p n 1 α σ 2 p n

51 Conclusions Constant-step-size averaged stochastic gradient descent Reaches convergence rate O(1/n) in all regimes Improves on the O(1/ n) lower-bound of non-smooth problems Efficient online Newton step for non-quadratic problems Robustness to step-size selection and adaptivity

52 Conclusions Constant-step-size averaged stochastic gradient descent Reaches convergence rate O(1/n) in all regimes Improves on the O(1/ n) lower-bound of non-smooth problems Efficient online Newton step for non-quadratic problems Robustness to step-size selection and adaptivity Extensions and future work Going beyond a single pass (Le Roux, Schmidt, and Bach, 2012; Defazio, Bach, and Lacoste-Julien, 2014) Proximal extensions fo non-differentiable terms Kernels and nonparametric estimation (Dieuleveut and Bach, 2014) Parallelization Non-convex problems

53 References A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright. Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. Information Theory, IEEE Transactions on, 58(5): , F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate o(1/n). Technical Report , HAL, Albert Benveniste, Michel Métivier, and Pierre Priouret. Adaptive algorithms and stochastic approximations. Springer Publishing Company, Incorporated, L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Adv. NIPS, A. d Aspremont. Smooth optimization with approximate gradient. SIAM J. Optim., 19(3): , Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages , A. Défossez and F. Bach. Constant step size least-mean-square: Bias-variance trade-offs and optimal sampling distributions A. Dieuleveut and F. Bach. Non-parametric Stochastic Approximation with Large Step sizes. Technical report, ArXiv, A. Dieuleveut, N. Flammarion, and F. Bach. Harder, better, faster, stronger convergence rates for least-squares regression. Technical Report , arxiv, 2016.

54 N. Flammarion and F. Bach. From averaging to acceleration, there is only a step-size. arxiv preprint arxiv: , L. Györfi and H. Walk. On the averaged stochastic approximation for linear regression. SIAM Journal on Control and Optimization, 34(1):31 61, G. Lan. An optimal method for stochastic composite optimization. Math. Program., 133(1-2, Ser. A): , N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with finite training sets. In Adv. NIPS, O. Macchi. Adaptive processing: The least mean squares approach with applications in transmission. Wiley West Sussex, A. S. Nemirovsky and D. B. Yudin. Problem complexity and method efficiency in optimization. Wiley & Sons, Y. Nesterov. A method for solving a convex programming problem with rate of convergence O(1/k 2 ). Soviet Math. Doklady, 269(3): , B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4): , H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Statistics, 22: , ISSN D. Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process. Technical Report 781, Cornell University Operations Research and Industrial Engineering, M. Schmidt, N. Le Roux, and F. Bach. Convergence rates for inexact proximal-gradient method. In

55 Adv. NIPS, A. B. Tsybakov. Optimal rates of aggregation

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - September 2012 Context Machine

More information

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Loucas Pillaud-Vivien

More information

Large-scale machine learning and convex optimization

Large-scale machine learning and convex optimization Large-scale machine learning and convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Allerton Conference - September 2015 Slides available at www.di.ens.fr/~fbach/gradsto_allerton.pdf

More information

Large-scale machine learning and convex optimization

Large-scale machine learning and convex optimization Large-scale machine learning and convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE IFCAM, Bangalore - July 2014 Big data revolution? A new scientific

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Stochastic optimization: Beyond stochastic gradients and convexity Part I

Stochastic optimization: Beyond stochastic gradients and convexity Part I Stochastic optimization: Beyond stochastic gradients and convexity Part I Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint tutorial with Suvrit Sra, MIT - NIPS

More information

Linearly-Convergent Stochastic-Gradient Methods

Linearly-Convergent Stochastic-Gradient Methods Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure

More information

Statistical machine learning and convex optimization

Statistical machine learning and convex optimization Statistical machine learning and convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Mastère M2 - Paris-Sud - Spring 2018 Slides available: www.di.ens.fr/~fbach/fbach_orsay_2018.pdf

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

ONLINE VARIANCE-REDUCING OPTIMIZATION

ONLINE VARIANCE-REDUCING OPTIMIZATION ONLINE VARIANCE-REDUCING OPTIMIZATION Nicolas Le Roux Google Brain nlr@google.com Reza Babanezhad University of British Columbia rezababa@cs.ubc.ca Pierre-Antoine Manzagol Google Brain manzagop@google.com

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where

More information

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization racle Complexity of Second-rder Methods for Smooth Convex ptimization Yossi Arjevani had Shamir Ron Shiff Weizmann Institute of Science Rehovot 7610001 Israel Abstract yossi.arjevani@weizmann.ac.il ohad.shamir@weizmann.ac.il

More information

Stochastic Gradient Descent. CS 584: Big Data Analytics

Stochastic Gradient Descent. CS 584: Big Data Analytics Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression

On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression arxiv:606.03000v [cs.it] 9 Jun 206 Kobi Cohen, Angelia Nedić and R. Srikant Abstract The problem

More information

Stochastic Composition Optimization

Stochastic Composition Optimization Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators

More information

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function

More information

Convergence Rates of Kernel Quadrature Rules

Convergence Rates of Kernel Quadrature Rules Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

On the Convergence Rate of Incremental Aggregated Gradient Algorithms

On the Convergence Rate of Incremental Aggregated Gradient Algorithms On the Convergence Rate of Incremental Aggregated Gradient Algorithms M. Gürbüzbalaban, A. Ozdaglar, P. Parrilo June 5, 2015 Abstract Motivated by applications to distributed optimization over networks

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Lecture 2 February 25th

Lecture 2 February 25th Statistical machine learning and convex optimization 06 Lecture February 5th Lecturer: Francis Bach Scribe: Guillaume Maillard, Nicolas Brosse This lecture deals with classical methods for convex optimization.

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Perturbed Proximal Gradient Algorithm

Perturbed Proximal Gradient Algorithm Perturbed Proximal Gradient Algorithm Gersende FORT LTCI, CNRS, Telecom ParisTech Université Paris-Saclay, 75013, Paris, France Large-scale inverse problems and optimization Applications to image processing

More information

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Simple Optimization, Bigger Models, and Faster Learning. Niao He Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Julien Mairal Inria, LEAR Team, Grenoble Journées MAS, Toulouse Julien Mairal Incremental and Stochastic

More information

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania

More information

Accelerating Stochastic Gradient Descent for Least Squares Regression

Accelerating Stochastic Gradient Descent for Least Squares Regression Proceedings of Machine Learning Research vol 75: 6, 8 3st Annual Conference on Learning Theory Accelerating Stochastic Gradient Descent for Least Squares Regression Prateek Jain Praneeth Netrapalli Microsoft

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models & Applications Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with

More information

Early Stopping for Computational Learning

Early Stopping for Computational Learning Early Stopping for Computational Learning Lorenzo Rosasco Universita di Genova, Massachusetts Institute of Technology Istituto Italiano di Tecnologia CBMM Sestri Levante, September, 2014 joint work with

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Nonlinear Optimization Methods for Machine Learning

Nonlinear Optimization Methods for Machine Learning Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks

More information

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal IPAM Summer School 2012 Tutorial on Optimization methods for machine learning Jorge Nocedal Northwestern University Overview 1. We discuss some characteristics of optimization problems arising in deep

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal Sub-Sampled Newton Methods for Machine Learning Jorge Nocedal Northwestern University Goldman Lecture, Sept 2016 1 Collaborators Raghu Bollapragada Northwestern University Richard Byrd University of Colorado

More information

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html

More information

Non-asymptotic bound for stochastic averaging

Non-asymptotic bound for stochastic averaging Non-asymptotic bound for stochastic averaging S. Gadat and F. Panloup Toulouse School of Economics PGMO Days, November, 14, 2017 I - Introduction I - 1 Motivations I - 2 Optimization I - 3 Stochastic Optimization

More information

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve

More information

Stochastic Optimization Methods for Machine Learning. Jorge Nocedal

Stochastic Optimization Methods for Machine Learning. Jorge Nocedal Stochastic Optimization Methods for Machine Learning Jorge Nocedal Northwestern University SIAM CSE, March 2017 1 Collaborators Richard Byrd R. Bollagragada N. Keskar University of Colorado Northwestern

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

More Optimization. Optimization Methods. Methods

More Optimization. Optimization Methods. Methods More More Optimization Optimization Methods Methods Yann YannLeCun LeCun Courant CourantInstitute Institute http://yann.lecun.com http://yann.lecun.com (almost) (almost) everything everything you've you've

More information

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Jinghui Chen Department of Systems and Information Engineering University of Virginia Quanquan Gu

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

A Parallel SGD method with Strong Convergence

A Parallel SGD method with Strong Convergence A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization

Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization Yossi Arjevani Department of Computer Science and Applied Mathematics Weizmann Institute of Science Rehovot 7610001,

More information

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Randomized Smoothing Techniques in Optimization

Randomized Smoothing Techniques in Optimization Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter Bartlett, Michael Jordan, Martin Wainwright, Andre Wibisono Stanford University Information Systems Laboratory

More information

Sequential and reinforcement learning: Stochastic Optimization I

Sequential and reinforcement learning: Stochastic Optimization I 1 Sequential and reinforcement learning: Stochastic Optimization I Sequential and reinforcement learning: Stochastic Optimization I Summary This session describes the important and nowadays framework of

More information

Static Parameter Estimation using Kalman Filtering and Proximal Operators Vivak Patel

Static Parameter Estimation using Kalman Filtering and Proximal Operators Vivak Patel Static Parameter Estimation using Kalman Filtering and Proximal Operators Viva Patel Department of Statistics, University of Chicago December 2, 2015 Acnowledge Mihai Anitescu Senior Computational Mathematician

More information

OPTIMIZATION METHODS IN DEEP LEARNING

OPTIMIZATION METHODS IN DEEP LEARNING Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate

More information

Oslo Class 2 Tikhonov regularization and kernels

Oslo Class 2 Tikhonov regularization and kernels RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n

More information

On the Influence of Momentum Acceleration on Online Learning

On the Influence of Momentum Acceleration on Online Learning Journal of Machine Learning Research 17 016) 1-66 Submitted 3/16; Revised 8/16; Published 10/16 On the Influence of Momentum Acceleration on Online Learning Kun Yuan Bicheng Ying Ali H. Sayed Department

More information

Statistical Sparse Online Regression: A Diffusion Approximation Perspective

Statistical Sparse Online Regression: A Diffusion Approximation Perspective Statistical Sparse Online Regression: A Diffusion Approximation Perspective Jianqing Fan Wenyan Gong Chris Junchi Li Qiang Sun Princeton University Princeton University Princeton University University

More information

Minimizing finite sums with the stochastic average gradient

Minimizing finite sums with the stochastic average gradient Math. Program., Ser. A (2017) 162:83 112 DOI 10.1007/s10107-016-1030-6 FULL LENGTH PAPER Minimizing finite sums with the stochastic average gradient Mark Schmidt 1 Nicolas Le Roux 2 Francis Bach 3 Received:

More information

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Finite-sum Composition Optimization via Variance Reduced Gradient Descent Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu

More information

Part 4: Conditional Random Fields

Part 4: Conditional Random Fields Part 4: Conditional Random Fields Sebastian Nowozin and Christoph H. Lampert Colorado Springs, 25th June 2011 1 / 39 Problem (Probabilistic Learning) Let d(y x) be the (unknown) true conditional distribution.

More information

Lecture: Adaptive Filtering

Lecture: Adaptive Filtering ECE 830 Spring 2013 Statistical Signal Processing instructors: K. Jamieson and R. Nowak Lecture: Adaptive Filtering Adaptive filters are commonly used for online filtering of signals. The goal is to estimate

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Randomized Smoothing for Stochastic Optimization

Randomized Smoothing for Stochastic Optimization Randomized Smoothing for Stochastic Optimization John Duchi Peter Bartlett Martin Wainwright University of California, Berkeley NIPS Big Learn Workshop, December 2011 Duchi (UC Berkeley) Smoothing and

More information

Stochastic Optimization Part I: Convex analysis and online stochastic optimization

Stochastic Optimization Part I: Convex analysis and online stochastic optimization Stochastic Optimization Part I: Convex analysis and online stochastic optimization Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Multiple kernel learning for multiple sources

Multiple kernel learning for multiple sources Multiple kernel learning for multiple sources Francis Bach INRIA - Ecole Normale Supérieure NIPS Workshop - December 2008 Talk outline Multiple sources in computer vision Multiple kernel learning (MKL)

More information

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function

More information

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):

More information

A Conservation Law Method in Optimization

A Conservation Law Method in Optimization A Conservation Law Method in Optimization Bin Shi Florida International University Tao Li Florida International University Sundaraja S. Iyengar Florida International University Abstract bshi1@cs.fiu.edu

More information

Modern Optimization Techniques

Modern Optimization Techniques Modern Optimization Techniques 2. Unconstrained Optimization / 2.2. Stochastic Gradient Descent Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University

More information

Incremental Quasi-Newton methods with local superlinear convergence rate

Incremental Quasi-Newton methods with local superlinear convergence rate Incremental Quasi-Newton methods wh local superlinear convergence rate Aryan Mokhtari, Mark Eisen, and Alejandro Ribeiro Department of Electrical and Systems Engineering Universy of Pennsylvania Int. Conference

More information

A Universal Catalyst for Gradient-Based Optimization

A Universal Catalyst for Gradient-Based Optimization A Universal Catalyst for Gradient-Based Optimization Julien Mairal Inria, Grenoble CIMI workshop, Toulouse, 2015 Julien Mairal, Inria Catalyst 1/58 Collaborators Hongzhou Lin Zaid Harchaoui Publication

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

Ergodic Subgradient Descent

Ergodic Subgradient Descent Ergodic Subgradient Descent John Duchi, Alekh Agarwal, Mikael Johansson, Michael Jordan University of California, Berkeley and Royal Institute of Technology (KTH), Sweden Allerton Conference, September

More information

RegML 2018 Class 2 Tikhonov regularization and kernels

RegML 2018 Class 2 Tikhonov regularization and kernels RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,

More information

Online but Accurate Inference for Latent Variable Models with Local Gibbs Sampling

Online but Accurate Inference for Latent Variable Models with Local Gibbs Sampling Online but Accurate Inference for Latent Variable Models with Local Gibbs Sampling Christophe Dupuy INRIA - Technicolor christophe.dupuy@inria.fr Francis Bach INRIA - ENS francis.bach@inria.fr Abstract

More information

Asymptotic and finite-sample properties of estimators based on stochastic gradients

Asymptotic and finite-sample properties of estimators based on stochastic gradients Asymptotic and finite-sample properties of estimators based on stochastic gradients Panagiotis (Panos) Toulis panos.toulis@chicagobooth.edu Econometrics and Statistics University of Chicago, Booth School

More information