Large Scale Machine Learning with Stochastic Gradient Descent

Similar documents
Large Scale Machine Learning and Stochastic Algorithms

Numerical Optimization Techniques

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Computational and Statistical Learning Theory

CS60021: Scalable Data Mining. Large Scale Machine Learning

Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Large-scale Stochastic Optimization

Stochastic optimization in Hilbert spaces

Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines

Stochastic Gradient Descent. CS 584: Big Data Analytics

Kernelized Perceptron Support Vector Machines

STA141C: Big Data & High Performance Statistical Computing

Linear Regression (continued)

Stochastic Optimization Algorithms Beyond SG

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

A Distributed Solver for Kernelized SVM

Day 3 Lecture 3. Optimizing deep networks

Reading Group on Deep Learning Session 1

More Optimization. Optimization Methods. Methods

Classification and Pattern Recognition

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Logistic Regression. Stochastic Gradient Descent

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Overfitting, Bias / Variance Analysis

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Warm up: risk prediction with logistic regression

Discriminative Models

Accelerating Stochastic Optimization

Big Data Analytics. Lucas Rego Drumond

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

Big Data Analytics: Optimization and Randomization

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

Linearly-Convergent Stochastic-Gradient Methods

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Beyond stochastic gradient descent for large-scale machine learning

Linear discriminant functions

Trading Convexity for Scalability

Support vector machines Lecture 4

Lecture 5: Logistic Regression. Neural Networks

ECS171: Machine Learning

Optimization Methods for Machine Learning

Discriminative Models

Training Support Vector Machines: Status and Challenges

ECS289: Scalable Machine Learning

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Introduction to Logistic Regression and Support Vector Machine

Machine Learning for NLP

Sequential and reinforcement learning: Stochastic Optimization I

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Stochastic Quasi-Newton Methods

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Logistic Regression. COMP 527 Danushka Bollegala

Variable Metric Stochastic Approximation Theory

Machine Learning Lecture 6 Note

Inverse Time Dependency in Convex Regularized Learning

Machine Learning Lecture 7

Convex Optimization Lecture 16

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Neural Networks (Part 1) Goals for the lecture

A Parallel SGD method with Strong Convergence

Neural Network Training

Supervised Learning Part I

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Full-information Online Learning

Stochastic Gradient Descent

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Multi-Label Informed Latent Semantic Indexing

Large Scale Semi-supervised Linear SVMs. University of Chicago

Beyond stochastic gradient descent for large-scale machine learning

Large-scale machine learning Stochastic gradient descent

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Classification Logistic Regression

CSC321 Lecture 2: Linear Regression

Stochastic gradient methods for machine learning

Coordinate Descent and Ascent Methods

Numerical Optimization

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

Support Vector and Kernel Methods

Stochastic gradient descent on Riemannian manifolds

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Links between Perceptrons, MLPs and SVMs

Ad Placement Strategies

Logistic Regression. Machine Learning Fall 2018

Large-scale Linear RankSVM

Linear Discrimination Functions

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Statistical Machine Learning from Data

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

CSCI567 Machine Learning (Fall 2018)

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Modern Optimization Techniques

MLCC 2017 Regularization Networks I: Linear Models

Simple Neural Nets For Pattern Classification

Transcription:

Large Scale Machine Learning with Stochastic Gradient Descent Léon Bottou leon@bottou.org Microsoft (since June)

Summary i. Learning with Stochastic Gradient Descent. ii. The Tradeoffs of Large Scale Learning. iii. Asymptotic Analysis. iv. Learning with a Single Pass. Léon Bottou 2/37

I. Learning with Stochastic Gradient Descent Léon Bottou 3/37

Example Binary classification Patterns x. Classes y = ±1. Linear model Choose features: Φ(x) R d Linear discriminant function: f w (x) = sign ( ) w Φ(x) Léon Bottou 4/37

SVM training Choose loss function Q(x, y, w) = l(y, f w (x)) = (e.g.) log (1 ) + e y w Φ(x) Cannot minimize the expected risk E(w) = Q(x, y, w) dp (x, y). Can compute the empirical risk E n (w) = 1 n n i=1 Q(x i, y i, w). Minimize L 2 regularized empirical risk min w λ 2 w 2 + 1 n n i=1 Q(x i, y i, w) Choosing λ is the same setting a constraint w 2 < B. Léon Bottou 5/37

Batch versus Online Batch: process all examples together (GD) Example: minimization by gradient descent Repeat: w w γ λw + 1 n Q n w (x i, y i, w) i=1 Online: process examples one by one (SGD) Example: minimization by stochastic gradient descent Repeat: (a) Pick random example x t, y ( t (b) w w γ t λw + Q ) w (x t, y t, w) Léon Bottou 6/37

Second order optimization Batch: (2GD) Example: Newton s algorithm Repeat: w w H 1 λw + 1 n n i=1 Q w (x i, y i, w) Online: (2SGD) Example: Second order stochastic gradient descent Repeat: (a) Pick random example x t, y ( t (b) w w γ t H 1 λw + Q ) w (x t, y t, w) Léon Bottou 7/37

More SGD Algorithms Adaline (Widrow and Hoff, 1960) ( Q adaline = 1 2 y w Φ(x) ) 2 Φ(x) R d, y = ±1 Perceptron (Rosenblatt, 1957) Q perceptron = max{0, y w Φ(x)} Φ(x) R d, y = ±1 Multilayer perceptrons (Rumelhart et al., 1986)... SVM (Cortes and Vapnik, 1995)... w w + γ t ( yt w Φ(x t ) ) Φ(x t ) w w + γ t { yt Φ(x t ) if y t w Φ(x t ) 0 0 otherwise Lasso (Tibshirani, 1996) ( Q lasso = λ w 1 + 1 2 y w Φ(x) ) 2 w = (u 1 v 1,..., u d v d ) Φ(x) R d, y R, λ > 0 K-Means (MacQueen, 1967) 1 Q kmeans = min k 2 (z w k) 2 z R d, w 1... w k R d n 1... n k N, initially 0 u i [ u i γ t ( λ (yt w Φ(x t ))Φ i (x t ) )] + v i [ v i γ t ( λ + (yt w t Φ(x t ))Φ i (x t ) )] + with notation [x] + = max{0, x}. k = arg min k (z t w k ) 2 n k n k + 1 w k w k + 1 n k (z t w k ) Léon Bottou 8/37

II. The Tradeoffs of Large Scale Learning Léon Bottou 9/37

The Computational Problem Baseline large-scale learning algorithm Randomly discarding data is the simplest way to handle large datasets. What is the statistical benefit of processing more data? What is the computational cost of processing more data? We need a theory that links Statistics and Computation! 1967: Vapnik s theory does not discuss computation. 1981: Valiant s learnability excludes exponential time algorithms, but (i) polynomial time already too slow, (ii) few actual results. Léon Bottou 10/37

Decomposition of the Error E( f n ) E(f ) = E(f F ) E(f ) Approximation error (E app ) + E(f n ) E(f F ) Estimation error (E est) + E( f n ) E(f n ) Optimization error (E opt ) Problem: Choose F, n, and ρ to make this as small as possible, subject to budget constraints { max number of examples n max computing time T Note: choosing λ is the same as choosing F. Léon Bottou 11/37

Small-scale Learning The active budget constraint is the number of examples. To reduce the estimation error, take n as large as the budget allows. To reduce the optimization error to zero, take ρ = 0. We need to adjust the size of F. Estimation error Approximation error Size of F See Structural Risk Minimization (Vapnik 74) and later works. Léon Bottou 12/37

Large-scale Learning The active budget constraint is the computing time. More complicated tradeoffs. The computing time depends on the three variables: F, n, and ρ. Example. If we choose ρ small, we decrease the optimization error. But we must also decrease F and/or n with adverse effects on the estimation and approximation errors. The exact tradeoff depends on the optimization algorithm. We can compare optimization algorithms rigorously. Léon Bottou 13/37

Test Error Test Error versus Learning Time Bayes Limit Computing Time Léon Bottou 14/37

Test Error Test Error versus Learning Time 10,000 examples 100,000 examples 1,000,000 examples Bayes limit Computing Time Vary the number of examples... Léon Bottou 15/37

Test Error Test Error versus Learning Time optimizer a optimizer b optimizer c model I model II model III model IV 10,000 examples 100,000 examples 1,000,000 examples Bayes limit Computing Time Vary the number of examples, the statistical models, the algorithms,... Léon Bottou 16/37

Test Error Test Error versus Learning Time optimizer a optimizer b optimizer c model I model II model III model IV Good Learning Algorithms 10,000 examples 100,000 examples 1,000,000 examples Bayes limit Computing Time Not all combinations are equal. Let s compare the red curve for different optimization algorithms. Léon Bottou 17/37

III. Asymptotic Analysis Léon Bottou 18/37

Asymptotic Analysis E( f n ) E(f ) = E = E app + E est + E opt Asymptotic Analysis All three errors must decrease with comparable rates. Forcing one of the errors to decrease much faster - would require additional computing efforts, - but would not significantly improve the test error. Léon Bottou 19/37

Statistics Asymptotics of the statistical components of the error Thanks to refined uniform convergence arguments ( ) log n α E = E app + E est + E opt E app + + ρ n with exponent 1 2 α 1. Asymptotically effective large scale learning Must choose F, n, and ρ such that E E app E est E opt ( ) log n α ρ. n What about optimization times? Léon Bottou 20/37

Statistics and Computation GD 2GD SGD 2SGD Time per iteration : n n 1 1 Iters to accuracy ρ : log 1 ρ log log 1 ρ Time to accuracy ρ : n log 1 ρ n log log 1 ρ 1 ρ 1 ρ 1 ρ 1 ρ Time to error E : 1 E 1/α log2 1 E 1 E 1/α log 1 E log log 1 E 1 E 1 E 2GD optimizes much faster than GD. SGD optimization speed is catastrophic. SGD learns faster than both GD and 2GD. 2SGD only changes the constants. Léon Bottou 21/37

Experiment: Text Categorization Dataset Reuters RCV1 document corpus. 781,265 training examples, 23,149 testing examples. Task Recognizing documents of category CCAT. 47,152 TF-IDF features. Linear SVM. Same setup as (Joachims, 2006) and (Shalev-Schwartz et al., 2007) using plain SGD. Léon Bottou 22/37

Experiment: Text Categorization Results: Hinge-loss SVM Q(x, y, w) = max{0, 1 yw Φ(x)} λ = 0.0001 Training Time Primal cost Test Error SVMLight 23,642 secs 0.2275 6.02% SVMPerf 66 secs 0.2278 6.03% SGD 1.4 secs 0.2275 6.02% Results: Log-Loss SVM Q(x, y, w) = log(1 + exp( yw Φ(x))) λ = 0.00001 Training Time Primal cost Test Error TRON(LibLinear, ε = 0.01) 30 secs 0.18907 5.68% TRON(LibLinear, ε = 0.001) 44 secs 0.18890 5.70% SGD 2.3 secs 0.18893 5.66% Léon Bottou 23/37

The Wall 0.3 Testing cost 0.2 Training time (secs) 100 SGD 50 TRON (LibLinear) 0.1 0.01 0.001 0.0001 1e 05 1e 06 1e 07 1e 08 1e 09 Optimization accuracy (trainingcost optimaltrainingcost) Léon Bottou 24/37

IV. Learning with a Single Pass Léon Bottou 25/37

Batch and online paths True solution, ONLINE Best generalization. one pass over * examples {z1...zt} w wt w * t w1 Best training set error. BATCH many iterations on examples {z1...zt} Léon Bottou 26/37

Effect of one Additional Example (i) Compare w n w n+1 = arg min w = arg min w E n (f w ) E n+1 (f w ) = arg min w [E n (f w ) + 1n l( f w (x n+1 ), y n+1 ) ] n+1 n En+1 (f ) w E n(f w) w* n+1 w* n Léon Bottou 27/37

Effect of one Additional Example (ii) First Order Calculation wn+1 = w n 1 l ( ) ( ) f wn (x n ), y n 1 n H 1 n+1 + O w n 2 where H n+1 is the empirical Hessian on n + 1 examples. Compare with Second Order Stochastic Gradient Descent w t+1 = w t 1 t H 1 l( f wt (x n ), y n ) w Could they converge with the same speed? C 2 assumptions = Accurate speed estimates. Léon Bottou 28/37

Speed of Scaled Stochastic Gradient Study w t+1 = w t 1 t B t ) l (f wt (x n ),y n w + O ( ) 1 t with Bt B 0, BH I/2. 2 Establish convergence a.s. via quasi-martingales (see Bottou, 1991, 1998). Let U t = H (w t w ) (w t w ). Observe E(f wt ) E(f w ) = tr(u t ) + o(tr(u t )) Derive E t (U t+1 ) = [ I 2BH t + o ( )] 1 t Ut + HBGB t + o ( ) 1 2 t where G is the Fisher matrix. 2 Lemma: study real sequence u t+1 = ( 1 + α t + o( )) 1 t ut + β t + o ( ) 1 2 t. 2 When α > 1 show u t = β 1 α 1 t + o( 1 t) (nasty proof!). When α < 1 show u t t α (up to log factors). Bracket E(tr(U t+1 )) between two such sequences and conclude: ( ) tr(hbgb) 1 1 2λ max BH 1 t +o t E [ E(f wt ) E(f w ) ] tr(hbgb) ( ) 1 1 2λ min BH 1 t +o t Interesting special cases: B = I/λ min H and B = H 1. Léon Bottou 29/37

Asymptotic Efficiency of Second Order SGD. Empirical optima Second-order SGD n E [ E(f w n ) E(f F ) ] = lim E(f wt ) E(f F ) ] t lim n n E[ w w n 2] = lim t E[ w w t 2] t Best solution in F. One Pass of Second Order Stochastic Gradient w n K/n w =w* w 0 = w* 0 Empirical Optima w* n Best training set error. (Fabian, 1973; Murata & Amari, 1998; Bottou & LeCun, 2003). Léon Bottou 30/37

Optimal Learning in One Pass A Single Pass of Second Order Stochastic Gradient generalizes as well as the Empirical Optimum. Experiments on synthetic data Mse* +1e 1 Mse* +1e 2 Mse* +1e 3 Mse* +1e 4 0.366 0.362 0.358 0.354 0.350 0.346 1000 10000 100000 Number of examples 0.342 100 1000 10000 Milliseconds Léon Bottou 31/37

Unfortunate Issues Unfortunate theoretical issue How long to reach the asymptotic regime? One-pass learning speed regime may not be reached in one pass... Unfortunate practical issue Second order SGD is rarely feasible. estimate and store d d matrix H 1. multiply the gradient for each example by this matrix H 1. Léon Bottou 32/37

Solutions Limited storage approximations of H 1 Diagonal Gauss-Newton (Becker and Lecun, 1989) Low rank approximation [olbfgs], (Schraudolph et al., 2007) Diagonal approximation [SGDQN], (Bordes et al., 2009) Averaged stochastic gradient Perform SGD with slowly decreasing gains, e.g. γ t t 0.75. Compute averages w t+1 = t t+1 w t + 1 t w t+1 Same asymptotic speed as 2SGD (Polyak and Juditsky, 92) Can take a while to reach the asymptotic regime. Léon Bottou 33/37

Experiment: ALPHA dataset From the 2008 Pascal Large Scale Learning Challenge. ( 2. Loss: Q(x, y, w) = max{0, 1 y w x}) SGD, SGDQN: γ t = γ 0 (1 + γ 0 λt) 1. ASGD: γ t = γ 0 (1 + γ 0 λt) 0.75 0.40 0.38 SGD SGDQN ASGD 27.0 26.0 SGD SGDQN ASGD Expected risk 0.36 0.34 Test Error (%) 25.0 24.0 23.0 0.32 22.0 0.30 0 1 2 3 4 5 Number of epochs 21.0 0 1 2 3 4 5 Number of epochs ASGD nearly reaches the optimal expected risk after a single pass. Léon Bottou 34/37

Experiment: Conditional Random Field CRF for the CONLL 2000 Chunking task. 1.7M parameters. 107,000 training segments. 5400 5300 5200 5100 5000 4900 4800 4700 4600 4500 4400 Test loss SGD SGDQN ASGD 0 5 10 15 epochs 94 93.8 93.6 93.4 93.2 93 92.8 92.6 92.4 92.2 92 Test FB1 score SGD SGDQN ASGD 0 5 10 15 epochs SGDQN more attractive than ASGD. Training times: 500s (SGD), 150s (ASGD), 75s (SGDQN). Standard LBFGS optimizer needs 72 minutes. Léon Bottou 35/37

V. Conclusions Léon Bottou 36/37

Conclusions Good optimization algorithm good learning algorithm. SGD is a poor optimization algorithm. SGD is a good learning algorithm for large scale problems. SGD variants can learn in a single pass (given enough data) Léon Bottou 37/37