Large Scale Machine Learning and Stochastic Algorithms

Size: px

Start display at page:

Download "Large Scale Machine Learning and Stochastic Algorithms"

Steven Lee
6 years ago
Views:

1 Large Scale Machine Learning and Stochastic Algorithms Léon Bottou NEC Laboratories America Princeton

2 Machine Learning in the 1980s A Path to Artificial Intelligence? Emulate cognitive capabilities of humans. Profund, but not immediately obvious, relation with the foundations of statistics, Solution does not involve a human expert, unlike the practice of statistics. Humans learn from abundant and diverse data: vision, sound, speech, text,.... More data than we can handle.

3 Machine Learning in the 2000s A Path to Artificial Intelligence? Still my ultimate goal... Opinions vary... Automated Data Mining Gain competitive advantages by analyzing the masses of data that describes the life of our computerized society. - Placement of web advertisement. - Customer relation management. - Fraud detection. Brute force approach. More data than we can handle.

Linear Time Learning Algorithms? The computing resources available for learning do not grow faster than the volume of data. The cost of data mining cannot exceed the revenues.

4 Linear Time Learning Algorithms? The computing resources available for learning do not grow faster than the volume of data. The cost of data mining cannot exceed the revenues. Intelligent animals learn from streaming data. Most machine learning or optimization algorithms demand resources that grow faster than the volume of data. Matrix operations (n 3 time for n 2 coefficients). Sparse matrix operations are worse.

5 Summary I. Machine Learning Redux. II. Machine Learning and Optimization. III. Stochastic Algorithms.

6 ML Redux: The Experimental Paradigm Variations: k-fold cross-validation, etc. This is the main driver for progress in machine learning.

7 ML Redux: Mathematical Statement (i) Assumption Examples are drawn independently from an unknown probability distribution P(x, y) that represents the laws of Nature. Loss Function Function l(ŷ, y) measures the cost of answering ŷ when the true answer is y. Expected Risk We seek to find the function f that minimizes: min E(f) = l( f(x), y ) dp(x, y) f Note: The test set error is an approximation of the expected risk.

8 ML Redux: Mathematical Statement (ii) Approximation Not feasible to search f among all functions. Instead, we search ff that minimizes the Expected Risk E(f) within some richly parametrized family of functions F. Estimation Not feasible to minimize the expectation E(f) because P(x, y) is unknown. Instead, we search f n that minimizes the Empirical Risk E n (f), that is, the average loss over the training set examples. n min f F E n(f) = 1 n i=1 l( f(x i ), y i ) In other words, we optimize a surrogate problem!

9 ML Redux: The Tradeoff E(f n ) E(f ) = ( E(f F ) E(f ) ) Approximation Error + ( E(f n ) E(f F )) Estimation Error Estimation error Approximation error Size of F (Vapnik and Chervonenkis, Ordered risk minimization, 1974). (Vapnik and Chervonenkis, Theorie der Zeichenerkennung, 1979)

10 The Computational Problem (i) Statistical Perspective: It is good to optimize an objective function than ensures a fast estimation rate when the number of examples increases. Optimization Perspective: To efficiently solve large problems, it is preferable to choose an optimization algorithm with strong asymptotic properties, e.g. superlinear. Incorrect Conclusion: To address large-scale learning problems, use a superlinear algorithm to optimize an objective function with fast estimation rate.

The Computational Problem (ii) Baseline large-scale learning algorithm Randomly discarding data is the simplest way to handle large datasets. What are the statistical benefits of processing more data?

11 The Computational Problem (ii) Baseline large-scale learning algorithm Randomly discarding data is the simplest way to handle large datasets. What are the statistical benefits of processing more data? What is the computational cost of processing more data? We need a theory that joins Statistics and Computation! 1967: Vapnik and Chervonenkis theory does not discuss computation. 1981: Valiant s learnability excludes exponential time algorithms, but (i) polynomial time already too slow, (ii) few actual results. We propose a new analysis of approximate optimization...

12 Test Error Test Error versus Learning Time Bayes Limit Computing Time

13 Test Error Test Error versus Learning Time 10,000 examples 100,000 examples 1,000,000 examples Bayes limit Computing Time Vary the number of examples...

14 Test Error Test Error versus Learning Time optimizer a optimizer b optimizer c model I model II model III model IV 10,000 examples 100,000 examples 1,000,000 examples Bayes limit Computing Time Vary the number of examples, the statistical models, the algorithms,...

15 Test Error Test Error versus Learning Time optimizer a optimizer b optimizer c model I model II model III model IV Good Learning Algorithms 10,000 examples 100,000 examples 1,000,000 examples Bayes limit Computing Time Not all combinations are equal.

16 Test Error versus Learning Time $ Good Learning Algorithms Point where we should start working on something else? Bayes limit $ Changing the units along the axes...

17 Learning with Approximate Optimization Computing f n = arg min f F E n (f) is often costly. Since we already optimize a surrogate function why should we compute its optimum f n exactly? Let s assume our optimizer returns f n such that E n ( f n ) < E n (f n ) + ρ. For instance, one could stop an iterative optimization algorithm long before its convergence.

18 Decomposition of the Error (i) E( f n ) E(f ) = E(fF ) E(f ) Approximation error + E(f n ) E(fF ) Estimation error + E( f n ) E(f n ) Optimization error Problem: Choose F, n, and ρ to make this as small as possible, subject to budget constraints { max number of examples n max computing time T

19 Decomposition of the Error (ii) Approximation error bound: decreases when F gets larger. (Approximation theory) Estimation error bound: decreases when n gets larger. increases when F gets larger. (Vapnik-Chervonenkis theory) Optimization error bound: increases with ρ. (Vapnik-Chervonenkis theory plus tricks) Computing time T: decreases with ρ increases with n increases with F (Algorithm dependent)

20 Small-scale vs. Large-scale Learning We can give rigorous definitions. Definition 1: We have a small-scale learning problem when the active budget constraint is the number of examples n. Definition 2: We have a large-scale learning problem when the active budget constraint is the computing time T.

21 Small-scale Learning The active budget constraint is the number of examples. To reduce the estimation error, take n as large as the budget allows. To reduce the optimization error to zero, take ρ = 0. We need to adjust the size of F. Estimation error Approximation error Size of F See Structural Risk Minimization (Vapnik 74) and later works.

22 Large-scale Learning The active budget constraint is the computing time. More complicated tradeoffs. The computing time depends on the three variables: F, n, and ρ. Example. If we choose ρ small, we decrease the optimization error. But we must also decrease F and/or n with adverse effects on the estimation and approximation errors. The exact tradeoff depends on the optimization algorithm. We can compare optimization algorithms rigorously.

23 Executive Summary log (ρ) Best ρ Good optimization algorithm (superlinear). ρ decreases faster than exp( T) Mediocre optimization algorithm (linear). ρ decreases like exp( T) Extraordinary poor optimization algorithm ρ decreases like 1/T log(t)

24 Asymptotics E( f n ) E(f ) = E(fF ) E(f ) Approximation error + E(f n ) E(fF ) Estimation error + E( f n ) E(f n ) Optimization error Asymoptotic Approach All three errors must decrease with comparable rates. Forcing one of the errors to be decrease much faster - costs in computing time, - but does not significantly improve the test error.

25 Asymptotics: Estimation Uniform convergence bounds ([ d Estimation error O n log n ] α ) d with 1 2 α 1. Value d describes the capacity of our system. The simplest capacity measure is the Vapnik-Chervonenkis dimension of F. There are in fact three (four?) types of bounds to consider: ( ) d Classical V-C bounds (pessimistic): O n ( d Relative V-C bounds in the realizable case: O n log n ) ([ d d Localized bounds (variance, Tsybakov): O n log n ] α ) d Fast estimation rates: (Bousquet, 2002; Tsybakov, 2004; Bartlett et al., 2005;... )

26 Asymptotics: Estimation+Optimization Uniform convergence arguments give ([ d Estimation error + Optimization error O n log n ] α + ρ) d = ε. This is true for all three cases of uniform convergence bounds. Scaling laws for ρ when F is fixed The approximation error is constant. ] α ) No need to choose ρ smaller than O([ dn log n d. ] α ) Not advisable to choose ρ larger than O([ dn log n d.

27 ... Approximation+Estimation+Optimization When F is chosen via a λ-regularized cost new Uniform convergence theory provides bounds for simple cases (Massart-2000; Zhang 2005; Steinwart et al., ;... ) Scaling laws for n, λ and ρ depend on the optimization algorithm. See (Shalev-Shwartz and Srebro, ICML 2008) for Linear SVMs. When F is realistically complicated Large datasets matter because one can use more features, because one can use richer models. Bounds for such cases are rarely realistic enough.

28 Analysis of a Simple Case Simple parametric setup F is fixed. Functions f w (x) linearly parametrized by w R d. Comparing four iterative optimization algorithms for E n (f) 1. Gradient descent. 2. Second order gradient descent (Newton). 3. Stochastic gradient descent. 4. Stochastic second order gradient descent.

29 Quantities of Interest Empirical Hessian at the empirical optimum w n. H = 2 E n w 2 (f w n ) = 1 n n i=1 2 l(f n (x i ), y i ) w 2 Empirical Fisher Information matrix at the empirical optimum w n. [ n ( l(fn ) ( ) ] (x i ), y i ) l(fn (x i ), y i ) G = 1 n i=1 w w Condition number We assume that there are λ min, λ max and ν such that trace ( GH 1) ν. spectrum ( H ) [λ min, λ max ]. and we define the condition number κ = λ max /λ min.

30 Gradient Descent (GD) Iterate w t+1 w t η E n(f wt ) w Gradient J Best speed achieved with fixed learning rate η = 1/λ max. (e.g., Dennis & Schnabel, 1983) Cost per Iterations Time to reach Time to reach iteration to reach ρ accuracy ρ E( f n ) E(fF ( ) ( ) ) ) < ε GD O(nd) O κ log ρ 1 O ndκ log ρ O( 1 d 2 κ ε 1/α log2 ε 1 In the last column, n and ρ are chosen to reach ε as fast as possible. Solve for ε to find the best error rate achievable in a given time. Remark: abuses of the O() notation

31 Second Order Gradient Descent (2GD) Iterate w t+1 w t H 1 E n(f wt ) w Gradient J We assume H 1 is known in advance. Superlinear optimization speed (e.g., Dennis & Schnabel, 1983) Cost per Iterations Time to reach Time to reach iteration to reach ρ accuracy ρ E( f n ) E(fF ) < ε 2GD O ( d ( d + n )) ( ) ( O log log ρ 1 O d ( d + n ) ) ( ) log log ρ 1 O d 2 ε 1/α log ε 1 log log 1 ε Optimization speed is much faster. Learning speed only saves the condition number κ.

32 Stochastic Gradient Descent (SGD) Iterate Draw random example (x t, y t ). w t+1 w t η t l(f wt (x t ), y t ) w Total Gradient <J(x,y,w)> Partial Gradient J(x,y,w) Best decreasing gain schedule with η = 1/λ min. (see Bottou, 1991; Murata, 1998; Bottou & LeCun, 2004) Cost per Iterations Time to reach Time to reach iteration to reach ρ accuracy ρ E( f n ) E(fF ( ) ) ) ) < ε 1ρ SGD O(d) ν k ρ + o With 1 k κ 2 O( d ν k ρ O( d ν k ε Optimization speed is catastrophic. Learning speed does not depend on the statistical estimation rate α. Learning speed depends on condition number κ but scales very well.

33 Second order Stochastic Descent (2SGD) Iterate Draw random example (x t, y t ). w t+1 w t 1 t H 1 l(f w t (x t ), y t ) w Total Gradient <J(x,y,w)> Partial Gradient J(x,y,w) Replace scalar gain η t by matrix 1 t H 1. Cost per Iterations Time to reach Time to reach iteration to reach ρ accuracy ρ E( f n ) E(fF ) < ε 2SGD O ( d 2) ( ) ) ) ν ρ + o 1ρ O( d 2 ν ρ O( d 2 ν ε Each iteration is d times more expensive. The number of iterations is reduced by κ 2 (or less.) Second order only changes the constants.

Benchmarking SGD in Simple Problems The theory suggests that SGD is very competitive. Many people associate SGD with trouble. SGD historically associated with back-propagation.

34 Benchmarking SGD in Simple Problems The theory suggests that SGD is very competitive. Many people associate SGD with trouble. SGD historically associated with back-propagation. Multilayer networks are very hard problems (nonlinear, nonconvex) What is difficult, SGD or MLP? Try PLAIN SGD on a simple learning problem. Download from These simple programs are very short.

35 Text Categorization with SVMs Dataset Reuters RCV1 document corpus. 781,265 training examples, 23,149 testing examples. 47,152 TF-IDF features. Task Recognizing documents of category CCAT. Minimize 1 n ( ) λ n 2 w2 + l( w x i + b, y i ). i=1 Update w w η t (w t, x t, y t ) = w η t ( λw + l(w x t + b, y t ) w Same setup as (Shalev-Schwartz et al., 2007) but plain SGD. )

36 Text Categorization with SVMs Results: Linear SVM l(ŷ, y) = max{0, 1 yŷ} λ = Training Time Primal cost Test Error SVMLight 23,642 secs % SVMPerf 66 secs % SGD 1.4 secs % Results: Log-Loss Classifier l(ŷ, y) = log(1 + exp( yŷ)) λ = Training Time Primal cost Test Error TRON(LibLinear, ε = 0.01) 30 secs % TRON(LibLinear, ε = 0.001) 44 secs % SGD 2.3 secs %

37 The Wall 0.3 Testing cost 0.2 Training time (secs) 100 SGD 50 TRON (LibLinear) e 05 1e 06 1e 07 1e 08 1e 09 Optimization accuracy (trainingcost optimaltrainingcost)

38 Text Chunking with CRFs Dataset CONLL 2000 Chunking Task: Segment sentences in syntactically correlated chunks (e.g., noun phrases, verb phrases.) 106,978 training segments in 8936 sentences. 23,852 testing segments in 2012 sentences. Model Conditional Random Field (all linear, log-loss.) Features are n-grams of words and part-of-speech tags. 1,679,700 parameters. Same setup as (Vishwanathan et al., 2006) but plain SGD.

39 Text Chunking with CRFs Results Training Time Primal cost Test F1 score L-BFGS 4335 secs % SGD 568 secs % Notes Computing the gradients with the chain rule runs faster than computing them with the forward-backward algorithm. Graph Transformer Networks are nonlinear conditional random fields trained with stochastic gradient descent (Bottou et al., 1997).

40 More SVM Experiments From: Patrick Haffner Date: Wednesday :28:50... I have tried on some of our main datasets... I can send you the example, it is so striking! Patrick Dataset Train Number of % non-0 LIBSVM LLAMA LLAMA SGDSVM size features features (SDot) SVM MAXENT Reuters 781K 47K 0.1% 210, Translation 1000K 274K % days 47,700 1,105 7 SuperTag 950K 46K % 31, Voicetone 579K 88K 0.019% 39,

41 More SVM Experiments From: Olivier Chapelle Date: Sunday :26:44... you should really run batch with various training set sizes... Average Test Loss n=10000 n= n= n=30000 n= stochastic Time (seconds) Log-loss problem Batch Conjugate Gradient on various training set sizes Stochastic Gradient on the full set Why is SGD near the enveloppe?

42 Effect of one Additional Example (i) Compare w n w n+1 = arg min w = arg min w E n (f w ) E n+1 (f w ) = arg min w [E n (f w ) + 1n l( f w (x n+1 ), y n+1 ) ] n+1 n En+1 (f ) w E n(f w) w* n+1 w* n

43 Effect of one Additional Example (ii) First Order Calculation wn+1 = w n 1 l ( ) ( ) f wn (x n ), y n 1 n H 1 n+1 + O w n 2 where H n+1 is the empirical Hessian on n + 1 examples. Compare with Second Order Stochastic Gradient Descent w t+1 = w t 1 t H 1 l( f wt (x n ), y n ) w Could they converge with the same speed? C 2 assumptions = Accurate speed estimates.

44 Speed of Scaled Stochastic Gradient Study w t+1 = w t 1 t B t ) l (f wt (x n ),y n w + O ( ) 1 t with Bt B 0, BH I/2. 2 Establish convergence a.s. via quasi-martingales (see Bottou, 1991, 1998). Let U t = H (w t w ) (w t w ). Observe E(f wt ) E(f w ) = tr(u t ) + o(tr(u t )) Derive E t (U t+1 ) = [ I 2BH t + o ( )] 1 t Ut + HBGB t + o ( 1 2 Lemma: study real sequence u t+1 = ( 1 + α t + o( 1 When α > 1 show u t = β 1 α 1 t + o( 1 t) (nasty proof!). When α < 1 show u t t α (up to log factors). ) t where G is the Fisher matrix. 2 )) t ut + β t + o ( ) 1 2 t. 2 Bracket E(tr(U t+1 )) between two such sequences and conclude: ( ) tr(hbgb) 1 1 2λ max BH 1 t +o t Interesting special cases: B = I/λ min H and B = H 1. E [ E(f wt ) E(f w ) ] tr(hbgb) ( ) 1 1 2λ min BH 1 t +o t After (Bottou & LeCun, 2003), see also (Fabian, 1973; Murata & Amari, 1998).

45 Chasing the constants pays. Follow the leader Second-order SGD lim n n E[ E(f w n ) E(f F ) ] = lim t E[ E(f wt ) E(f F ) ] = tr(g H 1 ) t lim n n E[ w wn 2] = lim t t E[ w w t 2] = tr(h 1 G H 1 ) Best solution in F. One Pass of Second Order Stochastic Gradient w n K/n w =w* w 0 = w* 0 Empirical Optima w* n Best training set error.

Optimal Learning in One Pass Given a large enough training set, a Single Pass of Second Order Stochastic Gradient generalizes as well as the Empirical Optimum.

46 Optimal Learning in One Pass Given a large enough training set, a Single Pass of Second Order Stochastic Gradient generalizes as well as the Empirical Optimum. Experiments on synthetic data Mse* +1e 1 Mse* +1e 2 Mse* +1e 3 Mse* +1e Number of examples Milliseconds

47 What matters is the Constant Test set performance decreases in K/t. Constant K depends on: Second order versus first order optimization. How quickly we reach the asymptotic constant. Number of operations per iteration. Parallel programming. Programming language. Programmer s skill. All on an equal footing!

48 Unfortunate Theoretical Issues The dangers of asymptotic analysis! Asymptotic one pass learning whenever B t H 1. E [ E(f wt ) E(f w ) ] = tr(g H 1 ) t ( ) 1 + o t What happens when B t H 1 slowly? One-pass learning speed regime may not be reached in one pass...

49 Unfortunate Practical Issues Second Order SGD is not that fast! w t+1 w t 1 t B t l(f wt (x t ), y t ) w Must estimate and store d d matrix H 1. Must multiply the gradient for each example by the matrix H 1. Sparsity tricks no longer work because H 1 is not sparse.

50 Directions in Stochastic Algorithm Design Limited storage approximations of H 1 Reduce the number of epochs Rarely sufficient for fast one-pass learning. Diagonal approximation (Becker &LeCun, 1989, Bordes & Bottou, 2007) Low rank approximation (e.g., LeCun et al., 1998, LeRoux et al., 2007) Online L-BFGS approximation (Schraudolph, 2007) Exploiting Duality Coordinate ascent in the dual is related to SGD. Dual Hessian sometimes more amenable to approximation. Some successes (Bordes and Bottou, ). Long term perspective uncertain. Averaged SGD Good asymptotic constants (Polyak and Juditsky 1992) Asymptotic regime long to settle... possibly fixable (Wei Xu, to appear)

51 Engineering Algorithms - Sparsity The very simple SGD update offers lots of engineering opportunities. Example: Sparse Linear SVM The update w w η ( λw + l(wx i, y i ) ) can be performed in two steps: i) w w η l(wx i, y i ) (sparse, cheap) ii) w w (1 ηλ) (not sparse, costly) Solution Perform only step (i) for each training example. Perform step (ii) with lower frequency and higher gain. General Idea Scheduling SGD operations according to their cost!

52 Engineering Algorithms SGD-QN (i) SGD-QN - A Stochastic Gradient SVM. Official winner of the Pascal Large Scale Learning Challenge Diagonal Hessian Approximation Gain on the condition numbers with lowest computational cost. Schedule of SGD operations according to their costs - regularization versus example updates - reestimation of the diagonal hessian. Careful engineering makes it run faster (although we did not get any points for that.) Design Principle Start with simple SGD. Incremental improvements of the constants. (Bordes, Bottou, Gallinary, 2008)

53 Engineering Algorithms SGD-QN (ii) Require: λ, w 0, t 0, T 1: t = 0 SVMSGD0 2: while t T do 3: w t+1 = w t 1 λ(t+t 0 ) (λw t + l (y t w tx t )y t x t ) 4: 5: 6: 7: 8: 9: t = t : end while 11: return w T SVMSGD2 Require: λ, w 0, t 0, T, skip 1: t = 0, count =skip 2: while t T do 3: w t+1 = w t 1 λ(t+t 0 ) l (y t w tx t )y t x t 4: count = count 1 5: if count < 0 then 6: w t+1 = w t+1 skip t+t 0 w t+1 7: count = skip 8: end if 9: t = t : end while 11: return w T

54 SVMSGD2 Require: λ, w 0, t 0, T, skip 1: t = 0, count =skip 2: 3: while t T do 4: w t+1 = w t 1 λ(t+t 0 ) l (y t w tx t )y t x t 5: 6: 7: 8: 9: 10: 11: count = count 1 12: if count < 0 then 13: w t+1 = w t+1 skip (t + t 0 ) 1 w t+1 14: count = skip 15: end if 16: t = t : end while 18: return w T SGD-QN Require: λ, w 0, t 0, T, skip 1: t = 0, count =skip, 2: B = λ 1 I, updateb =false, r = 0 3: while t T do 4: w t+1 = w t (t + t 0 ) 1 l (y t w tx t )y t B x t 5: if updateb = true then 6: p t = g t (w t+1 ) g( t (w t ) ) 7: i, B ii = B ii + 2 r [wt+1 w t ] i [p t ] 1 i B ii 8: i, B ii = max(b ii, 10 2 λ 1 ) 9: r = r + 1, updateb =false 10: end if 11: count = count 1 12: if count < 0 then 13: w t+1 = w t+1 skip (t + t 0 ) 1 λ B w t+1 14: count = skip, updateb = true 15: end if 16: t = t : end while 18: return w T

55 Engineering Algorithms SGD-QN (iii) SVMSGD2 SGD-QN olbfgs LibLinear SVMSGD2 SGD-QN olbfgs LibLinear Number of epochs Training time (sec.) Pascal Large Scale Challenge Delta dataset Test errors (in %) according to the number of epochs (left) and training duration (right).

56 SGD in Real Life: A Check Reader. Examples are pairs (image,amount). Field segmentation Character segmentation Character recognition Syntactical interpretation. Define differentiable modules. Pretrain modules with hand-labelled data. Assemble them to form a composite model. Define global cost function. Train with SGD for a few weeks. (Bottou, LeCun, Bengio, 1997; LeCun, Bottou, et al., 1998) Industrially deployed; ran billions of checks since 1996.

57 SGD in Real Life: Sentence Analysis Binary encoded sentence words. Words embedded in dim space Five Time Delay Multilayer networks : Part Of Speech Tagging ERR: 2.76% ( treebank, split / 23 ) State of the art: 2.75% Named Entity Recognition ( treebank, Stanford NER ) Chunking ( treebank ) F1: 88.97% State of the art: 89.31% F1: 92.7% ( 94.9% w/pos ) State of the art: 94.4%. Semantic Role Labeling ( propbank ) WER: ~14% State of the art: ~13% Positional information relative to the chosen predicate for semantic tagging Language Model ( wikipedia, 620M examples ) Trash (Collobert and Weston, 2007, 2008) [not my work] No hand-tuned parsing tricks. No hand-tuned linguistic features. State-of-the-art accuracies. Analyzes a sentence in 50 milliseconds (instead of seconds.) Trains with SGD in about 3 weeks.

58 Conclusions Qualitatively different tradeoffs for small and large scale learning problems. Traditional performance measurements for optimization algorithms do not apply very well to machine learning problems Stochastic algorithms provide excellent performance, both in theory and in practice. There is room for improvements. Engineering matters!

59 The End

60 Insights on the Gain Schedule (i) Decreasing gains: w t+1 w t η t + t 0 (w t, x t, y t ) Asymptotic Theory if s = 2η λ min < 1 then slow rate O ( t s) if s = 2η λ min > 1 then faster rate O ( s 2 s 1 t 1) Example: the SVM benchmark Use η = 1/λ because λ λ min. Choose t 0 to make sure that the expected initial updates are comparable with the expected size of the weights.

61 Insights on the Gain Schedule (ii) The sample size n does not change the SGD maths! Constant gain: w t+1 w t η (w t, x t, y t ) At any moment during training, we can: Pick a random subset of examples with moderate size. Try various gains η on the subsample. Pick the gain η that most reduces the cost. Use it for the next iterations on the full dataset.

62 Insights on Stopping Criteria Time to reach accuracy ρ 2SGD ( ) ν 1 ρ + o ρ SGD ( ) k ν 1 ρ + o ρ Number of epochs to reach same test cost as the full optimization. 1 k 1 k κ 2 There are many ways to make constant k smaller: Exact second order stochastic gradient descent. Approximate second order stochastic gradient descent. Simple preconditionning tricks.

63 Insights on Stopping Criteria Early stopping with cross validation Create a validation set by setting some training examples apart. Monitor cost function on the validation set. Stop when it stops decreasing. Early stopping a priori Extract two disjoint subsamples of training data. Train on the first subsample; stop by validating on the second. The number of epochs is an estimate of k. Train by performing that number of epochs on the full set. This is asymptotically correct and gives reasonable results in practice.

Large Scale Machine Learning with Stochastic Gradient Descent

Large Scale Machine Learning with Stochastic Gradient Descent Léon Bottou leon@bottou.org Microsoft (since June) Summary i. Learning with Stochastic Gradient Descent. ii. The Tradeoffs of Large Scale Learning.