Optimization for Machine Learning

Similar documents
Stochastic optimization: Beyond stochastic gradients and convexity Part I

Optimization for Machine Learning

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Stochastic gradient descent and robustness to ill-conditioning

Beyond stochastic gradient descent for large-scale machine learning

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Stochastic gradient methods for machine learning

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Stochastic gradient methods for machine learning

ONLINE VARIANCE-REDUCING OPTIMIZATION

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning

Big Data Analytics: Optimization and Randomization

Stochastic and online algorithms

Nesterov s Acceleration

Beyond stochastic gradient descent for large-scale machine learning

Fast Stochastic Optimization Algorithms for ML

SVRG++ with Non-uniform Sampling

Linearly-Convergent Stochastic-Gradient Methods

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Convex Optimization Lecture 16

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Beyond stochastic gradient descent for large-scale machine learning

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

A Universal Catalyst for Gradient-Based Optimization

Stochastic Gradient Descent with Variance Reduction

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

STA141C: Big Data & High Performance Statistical Computing

Accelerating Stochastic Optimization

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Accelerating SVRG via second-order information

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Comparison of Modern Stochastic Optimization Algorithms

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization

Lecture 3: Minimizing Large Sums. Peter Richtárik

Trade-Offs in Distributed Learning and Optimization

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization

ECS289: Scalable Machine Learning

Modern Optimization Techniques

Large-scale Stochastic Optimization

On the Convergence Rate of Incremental Aggregated Gradient Algorithms

Tracking the gradients using the Hessian: A new look at variance reducing stochastic methods

Learning with stochastic proximal gradient

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Lecture 2 February 25th

Statistical machine learning and convex optimization

Simple Optimization, Bigger Models, and Faster Learning. Niao He

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

arxiv: v1 [math.oc] 18 Mar 2016

Large-scale machine learning and convex optimization

A Parallel SGD method with Strong Convergence

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

Stochastic optimization in Hilbert spaces

Large-scale machine learning and convex optimization

Linear Convergence under the Polyak-Łojasiewicz Inequality

Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization

Linear Convergence under the Polyak-Łojasiewicz Inequality

Coordinate Descent and Ascent Methods

Sub-Sampled Newton Methods

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

1. Introduction. We consider the problem of minimizing the sum of two convex functions: = F (x)+r(x)}, f i (x),

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

arxiv: v2 [cs.lg] 4 Oct 2016

Lecture 1: Supervised Learning

Nonlinear Optimization Methods for Machine Learning

Stochastic Gradient Descent. CS 584: Big Data Analytics

Optimization for Machine Learning (Lecture 3-B - Nonconvex)

Coordinate Descent Faceoff: Primal or Dual?

Fast-and-Light Stochastic ADMM

arxiv: v2 [stat.ml] 16 Jun 2015

Stochastic Optimization

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Minimizing finite sums with the stochastic average gradient

arxiv: v2 [cs.na] 3 Apr 2018

On the Iteration Complexity of Oblivious First-Order Optimization Algorithms

arxiv: v1 [math.oc] 1 Jul 2016

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir

Bregman Divergence for Stochastic Variance Reduction: Saddle-Point and Adversarial Prediction

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Selected Topics in Optimization. Some slides borrowed from

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

arxiv: v1 [stat.ml] 20 May 2017

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation

Day 3 Lecture 3. Optimizing deep networks

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

MASAGA: A Stochastic Algorithm for Manifold Optimization

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

Proximal average approximated incremental gradient descent for composite penalty regularized empirical risk minimization

Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization

Stochastic Optimization Part I: Convex analysis and online stochastic optimization

Proximal Methods for Optimization with Spasity-inducing Norms

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Stochastic Gradient Descent with Only One Projection

Transcription:

Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS Tübingen Machine Learning Summer School, June 2017

Course materials http://suvrit.de/teaching.html Some references: Introductory lectures on convex optimization Nesterov Convex optimization Boyd & Vandenberghe Nonlinear programming Bertsekas Convex Analysis Rockafellar Fundamentals of convex analysis Urruty, Lemaréchal Lectures on modern convex optimization Nemirovski Optimization for Machine Learning Sra, Nowozin, Wright NIPS 2016 Optimization Tutorial Bach, Sra Some related courses: EE227A, Spring 2013, (Sra, UC Berkeley) 10-801, Spring 2014 (Sra, CMU) EE364a,b (Boyd, Stanford) EE236b,c (Vandenberghe, UCLA) Venues: NIPS, ICML, UAI, AISTATS, SIOPT, Math. Prog. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 2 / 22

Lecture Plan Introduction Recap of convexity, sets, functions Recap of duality, optimality, problems First-order optimization algorithms and techniques Large-scale optimization (SGD and friends) Directions in non-convex optimization Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 3 / 22

Iteration complexity: smooth problems Assumption: f convex and L-smooth on R d Gradient descent: θ t = θ t 1 γ k g (θ t 1 ) g(θ t ) g(x ) O(1/t) g(θ t ) g(x ) O(e t(µ/l) ) = O(e t/κ ) if µ-strongly convex (small κ = L/µ) (large κ = L/µ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 4 / 22

Iteration complexity: smooth problems Assumption: f convex and L-smooth on R d Gradient descent: θ t = θ t 1 γ k g (θ t 1 ) O(1/t) convergence rate for convex functions O(e t κ ) if strongly-convex complexity = O(nd κ log 1 ε ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 5 / 22

Iteration complexity: smooth problems Assumption: f convex and L-smooth on R d Gradient descent: θ t = θ t 1 γ k g (θ t 1 ) O(1/t) convergence rate for convex functions O(e t κ ) if strongly-convex complexity = O(nd κ log 1 ε ) Key insights for ML (Bottou and Bousquet, 2008) 1 No need to optimize below statistical error Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 5 / 22

Iteration complexity: smooth problems Assumption: f convex and L-smooth on R d Gradient descent: θ t = θ t 1 γ k g (θ t 1 ) O(1/t) convergence rate for convex functions O(e t κ ) if strongly-convex complexity = O(nd κ log 1 ε ) Key insights for ML (Bottou and Bousquet, 2008) 1 No need to optimize below statistical error 2 Cost functions are averages Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 5 / 22

Iteration complexity: smooth problems Assumption: f convex and L-smooth on R d Gradient descent: θ t = θ t 1 γ k g (θ t 1 ) O(1/t) convergence rate for convex functions O(e t κ ) if strongly-convex complexity = O(nd κ log 1 ε ) Key insights for ML (Bottou and Bousquet, 2008) 1 No need to optimize below statistical error 2 Cost functions are averages 3 Testing error is more important than training error Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 5 / 22

Stochastic gradient descent for finite sums min θ R d g(θ) = 1 n n f i (θ) Iteration: θ t = θ t 1 γ k f i(t) (θ t 1) Sampling with replacement: i(t) Unif({1,..., n}) Polyak-Ruppert averaging: θ t = 1 t+1 t u=0 θ u Convergence rate if each f i is convex L-smooth and f µ-strongly-convex: { E[g( θ t ) g(θ O(1/ k) if γk = 1/(L k) )] O(L/(µk)) = O(κ/k) if γ k = 1/(µk) i=1 Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 6 / 22

Stochastic vs. deterministic strongly cvx Min g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i, h(x i, θ) ) + λω(θ) i=1 Batch gradient descent: θ t = θ t 1 γ k g (θ t 1 ) = θ t 1 γ k n n f i (θ t 1) i=1 Linear (e.g., exponential) convergence rate in O(e t/κ ) Iteration complexity is linear in n Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 7 / 22

Stochastic vs. deterministic strongly cvx Min g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i, h(x i, θ) ) + λω(θ) i=1 Batch gradient descent: θ t = θ t 1 γ k g (θ t 1 ) = θ t 1 γ k n n f i (θ t 1) i=1 Linear (e.g., exponential) convergence rate in O(e t/κ ) Iteration complexity is linear in n Stochastic gradient descent: θ t = θ t 1 γ k f i(t) (θ t 1) Sampling with replacement: i(t) random element of {1,..., n} Convergence rate in O(κ/t) Iteration complexity is independent of n Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 7 / 22

Stochastic vs. deterministic strongly cvx Min g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i, h(x i, θ) ) + λω(θ) i=1 Batch gradient descent: θ t = θ t 1 γ k g (θ t 1 ) = θ t 1 γ k n n f i (θ t 1) i=1 Linear (e.g., exponential) convergence rate in O(e t/κ ) Iteration complexity is linear in n Stochastic gradient descent: θ t = θ t 1 γ k f i(t) (θ t 1) Sampling with replacement: i(t) random element of {1,..., n} Convergence rate in O(κ/t) Iteration complexity is independent of n GD SGD Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 7 / 22

Stochastic vs. deterministic methods log(excess cost) Goal = best of both worlds: Linear rate with O(d) iteration cost Simple choice of step size stochastic deterministic time Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 8 / 22

Stochastic vs. deterministic methods log(excess cost) Goal = best of both worlds: Linear rate with O(d) iteration cost Simple choice of step size stochastic new deterministic time Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 9 / 22

Linearly convergent stochastic gradient algorithms Many related algorithms SAG (Le Roux, Schmidt, and Bach, 2012) SDCA (Shalev-Shwartz and Zhang, 2013) SVRG (Johnson and Zhang, 2013; Zhang et al., 2013) MISO (Mairal, 2015) Finito (Defazio et al., 2014b) SAGA (Defazio, Bach, and Lacoste-Julien, 2014a) Similar rates of convergence and iterations Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 10 / 22

Linearly convergent stochastic gradient algorithms Many related algorithms SAG (Le Roux, Schmidt, and Bach, 2012) SDCA (Shalev-Shwartz and Zhang, 2013) SVRG (Johnson and Zhang, 2013; Zhang et al., 2013) MISO (Mairal, 2015) Finito (Defazio et al., 2014b) SAGA (Defazio, Bach, and Lacoste-Julien, 2014a) Similar rates of convergence and iterations Different interpretations and proofs / proof lengths Lazy gradient evaluations Variance reduction Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 10 / 22

Running-time comparisons (strongly-convex) Assumptions: g(θ) = 1 n n i=1 f i(θ) Each f i convex L-smooth and f is µ-strongly convex Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 11 / 22

Running-time comparisons (strongly-convex) Assumptions: g(θ) = 1 n n i=1 f i(θ) Each f i convex L-smooth and f is µ-strongly convex Stochastic gradient descent d L µ 1 ε Gradient descent d n L µ log 1 ε Accelerated gradient descent d n L µ log 1 ε SAG/SVRG d (n + L µ ) log 1 ε Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 11 / 22

Running-time comparisons (strongly-convex) Assumptions: g(θ) = 1 n n i=1 f i(θ) Each f i convex L-smooth and f is µ-strongly convex Stochastic gradient descent d L µ 1 ε Gradient descent d n L µ log 1 ε Accelerated gradient descent d n L µ log 1 ε SAG/SVRG d (n + L µ ) log 1 ε Beating two lower bounds (Nemirovski and Yudin, 1983; Nesterov, 2004): with additional assumptions (1) stochastic gradient: exponential rate for finite sums (2) full gradient: better exponential rate using the sum structure Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 11 / 22

Running-time comparisons (non-strongly-convex) Assumptions: g(θ) = 1 n n i=1 f i(θ) Each f i convex L-smooth Ill conditioned problems: f may not be strongly-convex Stochastic gradient descent d Gradient descent d Accelerated gradient descent d SAG/SVRG d 1/ε2 n/ε n/ ε n/ε Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 12 / 22

Running-time comparisons (non-strongly-convex) Assumptions: g(θ) = 1 n n i=1 f i(θ) Each f i convex L-smooth Ill conditioned problems: f may not be strongly-convex Stochastic gradient descent d Gradient descent d Accelerated gradient descent d SAG/SVRG d 1/ε2 n/ε n/ ε n/ε Adaptivity to potentially hidden strong convexity No need to know the local/global strong-convexity constant Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 12 / 22

Experimental results (logistic regression) quantum dataset rcv1 dataset (n = 50 000, d = 78) (n = 697 641, d = 47 236) IAG ASG AFG IAG L BFGS AFG SG SG L BFGS ASG Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 13 / 22

Experimental results (logistic regression) quantum dataset rcv1 dataset (n = 50 000, d = 78) (n = 697 641, d = 47 236) AFG IAG ASG SG L BFGS SAG LS L BFGS SG ASG IAG AFG SAG LS Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 14 / 22

Key Idea: Variance reduction Principle: reducing variance of sample of X by using a sample from another random variable Y with known expectation Z α = α(x Y) + EY EZ α = αex + (1 α)ey var(z α ) = α 2[ var(x) + var(y) 2cov(X, Y) ] α = 1: no bias, α < 1: potential bias (but reduced variance) Useful if Y positively correlated with X Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 15 / 22

Key Idea: Variance reduction Principle: reducing variance of sample of X by using a sample from another random variable Y with known expectation Z α = α(x Y) + EY EZ α = αex + (1 α)ey var(z α ) = α 2[ var(x) + var(y) 2cov(X, Y) ] α = 1: no bias, α < 1: potential bias (but reduced variance) Useful if Y positively correlated with X Application to gradient estimation (Johnson and Zhang, 2013; Zhang, Mahdavi, and Jin, 2013) SVRG: X = f i(t) (θ t 1), Y = f i(t) ( θ), α = 1, with θ stored EY = 1 n n i=1 f i ( θ) full gradient at θ; X Y = f i(t) (θ t 1) f i(t) ( θ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 15 / 22

Stochastic variance reduced gradient (SVRG) Initialize θ R d For i epoch = 1 to # of epochs Compute all gradients f i ( θ) ; store g ( θ) = 1 n Initialize x 0 = θ For t = 1 to length of epochs Update θ = θ t Output: θ θ t = θ t 1 γ n i=1 f i ( θ) [ g ( θ) + ( f i(t) (θ t 1) f i(t) )] ( θ) two gradient evaluations per inner step; no need to store gradients (SAG needs storage) Two parameters: length of epochs + step-size γ Same linear convergence rate as SAG, simpler proof Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 16 / 22

SVRG vs. SAGA SAGA update: θ t = θ t 1 γ[ 1 n n i=1 yt 1 i + ( f i(t) (θ t 1) y t 1 ) ] i(t) SVRG update: θ t = θ t 1 γ[ 1 n n i=1 f i ( θ) + ( f i(t) (θ t 1) f i(t) )] ( θ) SAGA SVRG Storage of gradients yes no Epoch-based no yes Parameters step-size step-size & epoch lengths Gradient evaluations per step 1 at least 2 Adaptivity to strong-convexity yes no Robustness to ill-conditioning yes no Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 17 / 22

Proximal extensions 1 Composite optimization problems: min x R d n n f i (θ)+h(θ) f i smooth and convex h convex, potentially non-smooth Constrained optimization: h an indicator function Sparsity-inducing norms, e.g., h(θ) = θ 1 Proximal methods (a.k.a. splitting methods) Projection / soft-thresholding step after gradient update See, e.g., Combettes and Pesquet (2011); Bach, Jenatton, Mairal, and Obozinski (2012); Parikh and Boyd (2014) Directly extends to variance-reduced gradient techniques Same rates of convergence i=1 Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 18 / 22

SGD minimizes the testing cost! Goal: minimize g(θ) = E p(x,y) l(y, θ Φ(x)) Given n independent samples (x i, y i ) n i=1, from p(x, y) Given a single pass of stochastic gradient descent Bounds on the excess testing cost Eg( θ n ) inf x R d g(θ) Optimal convergence rates: O(1/ n) and O(1/(nµ)) Optimal for non-smooth (Nemirovski and Yudin, 1983) Attained by averaged SGD with decaying step-sizes Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 19 / 22

References R. Babanezhad, M. O. Ahmed, A. Virani, M. W. Schmidt, J. Konecný, and S. Sallinen. Stopwasting my gradients: Practical SVRG. In Advances in Neural Information Processing Systems (NIPS), 2015. F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning, 4 (1):1 106, 2012. L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Adv. NIPS, 2008. P. L. Combettes and J.-C. Pesquet. Proximal splitting methods in signal processing. In Fixed-point algorithms for inverse problems in science and engineering, pages 185 212. Springer, 2011. A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems (NIPS), 2014a. A. Defazio, J. Domke, and T. S. Caetano. Finito: A faster, permutable incremental gradient method for big data problems. In Proc. ICML, 2014b. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 20 / 22

R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, 2013. N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with finite training sets. In Advances in Neural Information Processing Systems (NIPS), 2012. J. Mairal. Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM Journal on Optimization, 25(2):829 855, 2015. A. S. Nemirovski and D. B. Yudin. Problem complexity and method efficiency in optimization. Wiley & Sons, 1983. Y. Nesterov. Introductory lectures on convex optimization: a basic course. Kluwer, 2004. N. Parikh and S. P. Boyd. Proximal algorithms. Foundations and Trends in optimization, 1(3):127 239, 2014. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 21 / 22

S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14 (Feb):567 599, 2013. L. Zhang, M. Mahdavi, and R. Jin. Linear convergence with condition number independent access of full gradients. In Advances in Neural Information Processing Systems, 2013. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 22 / 22