Large Scale Machine Learning and Stochastic Algorithms

Size: px
Start display at page:

Download "Large Scale Machine Learning and Stochastic Algorithms"

Transcription

1 Large Scale Machine Learning and Stochastic Algorithms Léon Bottou NEC Laboratories America Princeton

2 Machine Learning in the 1980s A Path to Artificial Intelligence? Emulate cognitive capabilities of humans. Profund, but not immediately obvious, relation with the foundations of statistics, Solution does not involve a human expert, unlike the practice of statistics. Humans learn from abundant and diverse data: vision, sound, speech, text,.... More data than we can handle.

3 Machine Learning in the 2000s A Path to Artificial Intelligence? Still my ultimate goal... Opinions vary... Automated Data Mining Gain competitive advantages by analyzing the masses of data that describes the life of our computerized society. - Placement of web advertisement. - Customer relation management. - Fraud detection. Brute force approach. More data than we can handle.

4 Linear Time Learning Algorithms? The computing resources available for learning do not grow faster than the volume of data. The cost of data mining cannot exceed the revenues. Intelligent animals learn from streaming data. Most machine learning or optimization algorithms demand resources that grow faster than the volume of data. Matrix operations (n 3 time for n 2 coefficients). Sparse matrix operations are worse.

5 Summary I. Machine Learning Redux. II. Machine Learning and Optimization. III. Stochastic Algorithms.

6 ML Redux: The Experimental Paradigm Variations: k-fold cross-validation, etc. This is the main driver for progress in machine learning.

7 ML Redux: Mathematical Statement (i) Assumption Examples are drawn independently from an unknown probability distribution P(x, y) that represents the laws of Nature. Loss Function Function l(ŷ, y) measures the cost of answering ŷ when the true answer is y. Expected Risk We seek to find the function f that minimizes: min E(f) = l( f(x), y ) dp(x, y) f Note: The test set error is an approximation of the expected risk.

8 ML Redux: Mathematical Statement (ii) Approximation Not feasible to search f among all functions. Instead, we search ff that minimizes the Expected Risk E(f) within some richly parametrized family of functions F. Estimation Not feasible to minimize the expectation E(f) because P(x, y) is unknown. Instead, we search f n that minimizes the Empirical Risk E n (f), that is, the average loss over the training set examples. n min f F E n(f) = 1 n i=1 l( f(x i ), y i ) In other words, we optimize a surrogate problem!

9 ML Redux: The Tradeoff E(f n ) E(f ) = ( E(f F ) E(f ) ) Approximation Error + ( E(f n ) E(f F )) Estimation Error Estimation error Approximation error Size of F (Vapnik and Chervonenkis, Ordered risk minimization, 1974). (Vapnik and Chervonenkis, Theorie der Zeichenerkennung, 1979)

10 The Computational Problem (i) Statistical Perspective: It is good to optimize an objective function than ensures a fast estimation rate when the number of examples increases. Optimization Perspective: To efficiently solve large problems, it is preferable to choose an optimization algorithm with strong asymptotic properties, e.g. superlinear. Incorrect Conclusion: To address large-scale learning problems, use a superlinear algorithm to optimize an objective function with fast estimation rate.

11 The Computational Problem (ii) Baseline large-scale learning algorithm Randomly discarding data is the simplest way to handle large datasets. What are the statistical benefits of processing more data? What is the computational cost of processing more data? We need a theory that joins Statistics and Computation! 1967: Vapnik and Chervonenkis theory does not discuss computation. 1981: Valiant s learnability excludes exponential time algorithms, but (i) polynomial time already too slow, (ii) few actual results. We propose a new analysis of approximate optimization...

12 Test Error Test Error versus Learning Time Bayes Limit Computing Time

13 Test Error Test Error versus Learning Time 10,000 examples 100,000 examples 1,000,000 examples Bayes limit Computing Time Vary the number of examples...

14 Test Error Test Error versus Learning Time optimizer a optimizer b optimizer c model I model II model III model IV 10,000 examples 100,000 examples 1,000,000 examples Bayes limit Computing Time Vary the number of examples, the statistical models, the algorithms,...

15 Test Error Test Error versus Learning Time optimizer a optimizer b optimizer c model I model II model III model IV Good Learning Algorithms 10,000 examples 100,000 examples 1,000,000 examples Bayes limit Computing Time Not all combinations are equal.

16 Test Error versus Learning Time $ Good Learning Algorithms Point where we should start working on something else? Bayes limit $ Changing the units along the axes...

17 Learning with Approximate Optimization Computing f n = arg min f F E n (f) is often costly. Since we already optimize a surrogate function why should we compute its optimum f n exactly? Let s assume our optimizer returns f n such that E n ( f n ) < E n (f n ) + ρ. For instance, one could stop an iterative optimization algorithm long before its convergence.

18 Decomposition of the Error (i) E( f n ) E(f ) = E(fF ) E(f ) Approximation error + E(f n ) E(fF ) Estimation error + E( f n ) E(f n ) Optimization error Problem: Choose F, n, and ρ to make this as small as possible, subject to budget constraints { max number of examples n max computing time T

19 Decomposition of the Error (ii) Approximation error bound: decreases when F gets larger. (Approximation theory) Estimation error bound: decreases when n gets larger. increases when F gets larger. (Vapnik-Chervonenkis theory) Optimization error bound: increases with ρ. (Vapnik-Chervonenkis theory plus tricks) Computing time T: decreases with ρ increases with n increases with F (Algorithm dependent)

20 Small-scale vs. Large-scale Learning We can give rigorous definitions. Definition 1: We have a small-scale learning problem when the active budget constraint is the number of examples n. Definition 2: We have a large-scale learning problem when the active budget constraint is the computing time T.

21 Small-scale Learning The active budget constraint is the number of examples. To reduce the estimation error, take n as large as the budget allows. To reduce the optimization error to zero, take ρ = 0. We need to adjust the size of F. Estimation error Approximation error Size of F See Structural Risk Minimization (Vapnik 74) and later works.

22 Large-scale Learning The active budget constraint is the computing time. More complicated tradeoffs. The computing time depends on the three variables: F, n, and ρ. Example. If we choose ρ small, we decrease the optimization error. But we must also decrease F and/or n with adverse effects on the estimation and approximation errors. The exact tradeoff depends on the optimization algorithm. We can compare optimization algorithms rigorously.

23 Executive Summary log (ρ) Best ρ Good optimization algorithm (superlinear). ρ decreases faster than exp( T) Mediocre optimization algorithm (linear). ρ decreases like exp( T) Extraordinary poor optimization algorithm ρ decreases like 1/T log(t)

24 Asymptotics E( f n ) E(f ) = E(fF ) E(f ) Approximation error + E(f n ) E(fF ) Estimation error + E( f n ) E(f n ) Optimization error Asymoptotic Approach All three errors must decrease with comparable rates. Forcing one of the errors to be decrease much faster - costs in computing time, - but does not significantly improve the test error.

25 Asymptotics: Estimation Uniform convergence bounds ([ d Estimation error O n log n ] α ) d with 1 2 α 1. Value d describes the capacity of our system. The simplest capacity measure is the Vapnik-Chervonenkis dimension of F. There are in fact three (four?) types of bounds to consider: ( ) d Classical V-C bounds (pessimistic): O n ( d Relative V-C bounds in the realizable case: O n log n ) ([ d d Localized bounds (variance, Tsybakov): O n log n ] α ) d Fast estimation rates: (Bousquet, 2002; Tsybakov, 2004; Bartlett et al., 2005;... )

26 Asymptotics: Estimation+Optimization Uniform convergence arguments give ([ d Estimation error + Optimization error O n log n ] α + ρ) d = ε. This is true for all three cases of uniform convergence bounds. Scaling laws for ρ when F is fixed The approximation error is constant. ] α ) No need to choose ρ smaller than O([ dn log n d. ] α ) Not advisable to choose ρ larger than O([ dn log n d.

27 ... Approximation+Estimation+Optimization When F is chosen via a λ-regularized cost new Uniform convergence theory provides bounds for simple cases (Massart-2000; Zhang 2005; Steinwart et al., ;... ) Scaling laws for n, λ and ρ depend on the optimization algorithm. See (Shalev-Shwartz and Srebro, ICML 2008) for Linear SVMs. When F is realistically complicated Large datasets matter because one can use more features, because one can use richer models. Bounds for such cases are rarely realistic enough.

28 Analysis of a Simple Case Simple parametric setup F is fixed. Functions f w (x) linearly parametrized by w R d. Comparing four iterative optimization algorithms for E n (f) 1. Gradient descent. 2. Second order gradient descent (Newton). 3. Stochastic gradient descent. 4. Stochastic second order gradient descent.

29 Quantities of Interest Empirical Hessian at the empirical optimum w n. H = 2 E n w 2 (f w n ) = 1 n n i=1 2 l(f n (x i ), y i ) w 2 Empirical Fisher Information matrix at the empirical optimum w n. [ n ( l(fn ) ( ) ] (x i ), y i ) l(fn (x i ), y i ) G = 1 n i=1 w w Condition number We assume that there are λ min, λ max and ν such that trace ( GH 1) ν. spectrum ( H ) [λ min, λ max ]. and we define the condition number κ = λ max /λ min.

30 Gradient Descent (GD) Iterate w t+1 w t η E n(f wt ) w Gradient J Best speed achieved with fixed learning rate η = 1/λ max. (e.g., Dennis & Schnabel, 1983) Cost per Iterations Time to reach Time to reach iteration to reach ρ accuracy ρ E( f n ) E(fF ( ) ( ) ) ) < ε GD O(nd) O κ log ρ 1 O ndκ log ρ O( 1 d 2 κ ε 1/α log2 ε 1 In the last column, n and ρ are chosen to reach ε as fast as possible. Solve for ε to find the best error rate achievable in a given time. Remark: abuses of the O() notation

31 Second Order Gradient Descent (2GD) Iterate w t+1 w t H 1 E n(f wt ) w Gradient J We assume H 1 is known in advance. Superlinear optimization speed (e.g., Dennis & Schnabel, 1983) Cost per Iterations Time to reach Time to reach iteration to reach ρ accuracy ρ E( f n ) E(fF ) < ε 2GD O ( d ( d + n )) ( ) ( O log log ρ 1 O d ( d + n ) ) ( ) log log ρ 1 O d 2 ε 1/α log ε 1 log log 1 ε Optimization speed is much faster. Learning speed only saves the condition number κ.

32 Stochastic Gradient Descent (SGD) Iterate Draw random example (x t, y t ). w t+1 w t η t l(f wt (x t ), y t ) w Total Gradient <J(x,y,w)> Partial Gradient J(x,y,w) Best decreasing gain schedule with η = 1/λ min. (see Bottou, 1991; Murata, 1998; Bottou & LeCun, 2004) Cost per Iterations Time to reach Time to reach iteration to reach ρ accuracy ρ E( f n ) E(fF ( ) ) ) ) < ε 1ρ SGD O(d) ν k ρ + o With 1 k κ 2 O( d ν k ρ O( d ν k ε Optimization speed is catastrophic. Learning speed does not depend on the statistical estimation rate α. Learning speed depends on condition number κ but scales very well.

33 Second order Stochastic Descent (2SGD) Iterate Draw random example (x t, y t ). w t+1 w t 1 t H 1 l(f w t (x t ), y t ) w Total Gradient <J(x,y,w)> Partial Gradient J(x,y,w) Replace scalar gain η t by matrix 1 t H 1. Cost per Iterations Time to reach Time to reach iteration to reach ρ accuracy ρ E( f n ) E(fF ) < ε 2SGD O ( d 2) ( ) ) ) ν ρ + o 1ρ O( d 2 ν ρ O( d 2 ν ε Each iteration is d times more expensive. The number of iterations is reduced by κ 2 (or less.) Second order only changes the constants.

34 Benchmarking SGD in Simple Problems The theory suggests that SGD is very competitive. Many people associate SGD with trouble. SGD historically associated with back-propagation. Multilayer networks are very hard problems (nonlinear, nonconvex) What is difficult, SGD or MLP? Try PLAIN SGD on a simple learning problem. Download from These simple programs are very short.

35 Text Categorization with SVMs Dataset Reuters RCV1 document corpus. 781,265 training examples, 23,149 testing examples. 47,152 TF-IDF features. Task Recognizing documents of category CCAT. Minimize 1 n ( ) λ n 2 w2 + l( w x i + b, y i ). i=1 Update w w η t (w t, x t, y t ) = w η t ( λw + l(w x t + b, y t ) w Same setup as (Shalev-Schwartz et al., 2007) but plain SGD. )

36 Text Categorization with SVMs Results: Linear SVM l(ŷ, y) = max{0, 1 yŷ} λ = Training Time Primal cost Test Error SVMLight 23,642 secs % SVMPerf 66 secs % SGD 1.4 secs % Results: Log-Loss Classifier l(ŷ, y) = log(1 + exp( yŷ)) λ = Training Time Primal cost Test Error TRON(LibLinear, ε = 0.01) 30 secs % TRON(LibLinear, ε = 0.001) 44 secs % SGD 2.3 secs %

37 The Wall 0.3 Testing cost 0.2 Training time (secs) 100 SGD 50 TRON (LibLinear) e 05 1e 06 1e 07 1e 08 1e 09 Optimization accuracy (trainingcost optimaltrainingcost)

38 Text Chunking with CRFs Dataset CONLL 2000 Chunking Task: Segment sentences in syntactically correlated chunks (e.g., noun phrases, verb phrases.) 106,978 training segments in 8936 sentences. 23,852 testing segments in 2012 sentences. Model Conditional Random Field (all linear, log-loss.) Features are n-grams of words and part-of-speech tags. 1,679,700 parameters. Same setup as (Vishwanathan et al., 2006) but plain SGD.

39 Text Chunking with CRFs Results Training Time Primal cost Test F1 score L-BFGS 4335 secs % SGD 568 secs % Notes Computing the gradients with the chain rule runs faster than computing them with the forward-backward algorithm. Graph Transformer Networks are nonlinear conditional random fields trained with stochastic gradient descent (Bottou et al., 1997).

40 More SVM Experiments From: Patrick Haffner Date: Wednesday :28:50... I have tried on some of our main datasets... I can send you the example, it is so striking! Patrick Dataset Train Number of % non-0 LIBSVM LLAMA LLAMA SGDSVM size features features (SDot) SVM MAXENT Reuters 781K 47K 0.1% 210, Translation 1000K 274K % days 47,700 1,105 7 SuperTag 950K 46K % 31, Voicetone 579K 88K 0.019% 39,

41 More SVM Experiments From: Olivier Chapelle Date: Sunday :26:44... you should really run batch with various training set sizes... Average Test Loss n=10000 n= n= n=30000 n= stochastic Time (seconds) Log-loss problem Batch Conjugate Gradient on various training set sizes Stochastic Gradient on the full set Why is SGD near the enveloppe?

42 Effect of one Additional Example (i) Compare w n w n+1 = arg min w = arg min w E n (f w ) E n+1 (f w ) = arg min w [E n (f w ) + 1n l( f w (x n+1 ), y n+1 ) ] n+1 n En+1 (f ) w E n(f w) w* n+1 w* n

43 Effect of one Additional Example (ii) First Order Calculation wn+1 = w n 1 l ( ) ( ) f wn (x n ), y n 1 n H 1 n+1 + O w n 2 where H n+1 is the empirical Hessian on n + 1 examples. Compare with Second Order Stochastic Gradient Descent w t+1 = w t 1 t H 1 l( f wt (x n ), y n ) w Could they converge with the same speed? C 2 assumptions = Accurate speed estimates.

44 Speed of Scaled Stochastic Gradient Study w t+1 = w t 1 t B t ) l (f wt (x n ),y n w + O ( ) 1 t with Bt B 0, BH I/2. 2 Establish convergence a.s. via quasi-martingales (see Bottou, 1991, 1998). Let U t = H (w t w ) (w t w ). Observe E(f wt ) E(f w ) = tr(u t ) + o(tr(u t )) Derive E t (U t+1 ) = [ I 2BH t + o ( )] 1 t Ut + HBGB t + o ( 1 2 Lemma: study real sequence u t+1 = ( 1 + α t + o( 1 When α > 1 show u t = β 1 α 1 t + o( 1 t) (nasty proof!). When α < 1 show u t t α (up to log factors). ) t where G is the Fisher matrix. 2 )) t ut + β t + o ( ) 1 2 t. 2 Bracket E(tr(U t+1 )) between two such sequences and conclude: ( ) tr(hbgb) 1 1 2λ max BH 1 t +o t Interesting special cases: B = I/λ min H and B = H 1. E [ E(f wt ) E(f w ) ] tr(hbgb) ( ) 1 1 2λ min BH 1 t +o t After (Bottou & LeCun, 2003), see also (Fabian, 1973; Murata & Amari, 1998).

45 Chasing the constants pays. Follow the leader Second-order SGD lim n n E[ E(f w n ) E(f F ) ] = lim t E[ E(f wt ) E(f F ) ] = tr(g H 1 ) t lim n n E[ w wn 2] = lim t t E[ w w t 2] = tr(h 1 G H 1 ) Best solution in F. One Pass of Second Order Stochastic Gradient w n K/n w =w* w 0 = w* 0 Empirical Optima w* n Best training set error.

46 Optimal Learning in One Pass Given a large enough training set, a Single Pass of Second Order Stochastic Gradient generalizes as well as the Empirical Optimum. Experiments on synthetic data Mse* +1e 1 Mse* +1e 2 Mse* +1e 3 Mse* +1e Number of examples Milliseconds

47 What matters is the Constant Test set performance decreases in K/t. Constant K depends on: Second order versus first order optimization. How quickly we reach the asymptotic constant. Number of operations per iteration. Parallel programming. Programming language. Programmer s skill. All on an equal footing!

48 Unfortunate Theoretical Issues The dangers of asymptotic analysis! Asymptotic one pass learning whenever B t H 1. E [ E(f wt ) E(f w ) ] = tr(g H 1 ) t ( ) 1 + o t What happens when B t H 1 slowly? One-pass learning speed regime may not be reached in one pass...

49 Unfortunate Practical Issues Second Order SGD is not that fast! w t+1 w t 1 t B t l(f wt (x t ), y t ) w Must estimate and store d d matrix H 1. Must multiply the gradient for each example by the matrix H 1. Sparsity tricks no longer work because H 1 is not sparse.

50 Directions in Stochastic Algorithm Design Limited storage approximations of H 1 Reduce the number of epochs Rarely sufficient for fast one-pass learning. Diagonal approximation (Becker &LeCun, 1989, Bordes & Bottou, 2007) Low rank approximation (e.g., LeCun et al., 1998, LeRoux et al., 2007) Online L-BFGS approximation (Schraudolph, 2007) Exploiting Duality Coordinate ascent in the dual is related to SGD. Dual Hessian sometimes more amenable to approximation. Some successes (Bordes and Bottou, ). Long term perspective uncertain. Averaged SGD Good asymptotic constants (Polyak and Juditsky 1992) Asymptotic regime long to settle... possibly fixable (Wei Xu, to appear)

51 Engineering Algorithms - Sparsity The very simple SGD update offers lots of engineering opportunities. Example: Sparse Linear SVM The update w w η ( λw + l(wx i, y i ) ) can be performed in two steps: i) w w η l(wx i, y i ) (sparse, cheap) ii) w w (1 ηλ) (not sparse, costly) Solution Perform only step (i) for each training example. Perform step (ii) with lower frequency and higher gain. General Idea Scheduling SGD operations according to their cost!

52 Engineering Algorithms SGD-QN (i) SGD-QN - A Stochastic Gradient SVM. Official winner of the Pascal Large Scale Learning Challenge Diagonal Hessian Approximation Gain on the condition numbers with lowest computational cost. Schedule of SGD operations according to their costs - regularization versus example updates - reestimation of the diagonal hessian. Careful engineering makes it run faster (although we did not get any points for that.) Design Principle Start with simple SGD. Incremental improvements of the constants. (Bordes, Bottou, Gallinary, 2008)

53 Engineering Algorithms SGD-QN (ii) Require: λ, w 0, t 0, T 1: t = 0 SVMSGD0 2: while t T do 3: w t+1 = w t 1 λ(t+t 0 ) (λw t + l (y t w tx t )y t x t ) 4: 5: 6: 7: 8: 9: t = t : end while 11: return w T SVMSGD2 Require: λ, w 0, t 0, T, skip 1: t = 0, count =skip 2: while t T do 3: w t+1 = w t 1 λ(t+t 0 ) l (y t w tx t )y t x t 4: count = count 1 5: if count < 0 then 6: w t+1 = w t+1 skip t+t 0 w t+1 7: count = skip 8: end if 9: t = t : end while 11: return w T

54 SVMSGD2 Require: λ, w 0, t 0, T, skip 1: t = 0, count =skip 2: 3: while t T do 4: w t+1 = w t 1 λ(t+t 0 ) l (y t w tx t )y t x t 5: 6: 7: 8: 9: 10: 11: count = count 1 12: if count < 0 then 13: w t+1 = w t+1 skip (t + t 0 ) 1 w t+1 14: count = skip 15: end if 16: t = t : end while 18: return w T SGD-QN Require: λ, w 0, t 0, T, skip 1: t = 0, count =skip, 2: B = λ 1 I, updateb =false, r = 0 3: while t T do 4: w t+1 = w t (t + t 0 ) 1 l (y t w tx t )y t B x t 5: if updateb = true then 6: p t = g t (w t+1 ) g( t (w t ) ) 7: i, B ii = B ii + 2 r [wt+1 w t ] i [p t ] 1 i B ii 8: i, B ii = max(b ii, 10 2 λ 1 ) 9: r = r + 1, updateb =false 10: end if 11: count = count 1 12: if count < 0 then 13: w t+1 = w t+1 skip (t + t 0 ) 1 λ B w t+1 14: count = skip, updateb = true 15: end if 16: t = t : end while 18: return w T

55 Engineering Algorithms SGD-QN (iii) SVMSGD2 SGD-QN olbfgs LibLinear SVMSGD2 SGD-QN olbfgs LibLinear Number of epochs Training time (sec.) Pascal Large Scale Challenge Delta dataset Test errors (in %) according to the number of epochs (left) and training duration (right).

56 SGD in Real Life: A Check Reader. Examples are pairs (image,amount). Field segmentation Character segmentation Character recognition Syntactical interpretation. Define differentiable modules. Pretrain modules with hand-labelled data. Assemble them to form a composite model. Define global cost function. Train with SGD for a few weeks. (Bottou, LeCun, Bengio, 1997; LeCun, Bottou, et al., 1998) Industrially deployed; ran billions of checks since 1996.

57 SGD in Real Life: Sentence Analysis Binary encoded sentence words. Words embedded in dim space Five Time Delay Multilayer networks : Part Of Speech Tagging ERR: 2.76% ( treebank, split / 23 ) State of the art: 2.75% Named Entity Recognition ( treebank, Stanford NER ) Chunking ( treebank ) F1: 88.97% State of the art: 89.31% F1: 92.7% ( 94.9% w/pos ) State of the art: 94.4%. Semantic Role Labeling ( propbank ) WER: ~14% State of the art: ~13% Positional information relative to the chosen predicate for semantic tagging Language Model ( wikipedia, 620M examples ) Trash (Collobert and Weston, 2007, 2008) [not my work] No hand-tuned parsing tricks. No hand-tuned linguistic features. State-of-the-art accuracies. Analyzes a sentence in 50 milliseconds (instead of seconds.) Trains with SGD in about 3 weeks.

58 Conclusions Qualitatively different tradeoffs for small and large scale learning problems. Traditional performance measurements for optimization algorithms do not apply very well to machine learning problems Stochastic algorithms provide excellent performance, both in theory and in practice. There is room for improvements. Engineering matters!

59 The End

60 Insights on the Gain Schedule (i) Decreasing gains: w t+1 w t η t + t 0 (w t, x t, y t ) Asymptotic Theory if s = 2η λ min < 1 then slow rate O ( t s) if s = 2η λ min > 1 then faster rate O ( s 2 s 1 t 1) Example: the SVM benchmark Use η = 1/λ because λ λ min. Choose t 0 to make sure that the expected initial updates are comparable with the expected size of the weights.

61 Insights on the Gain Schedule (ii) The sample size n does not change the SGD maths! Constant gain: w t+1 w t η (w t, x t, y t ) At any moment during training, we can: Pick a random subset of examples with moderate size. Try various gains η on the subsample. Pick the gain η that most reduces the cost. Use it for the next iterations on the full dataset.

62 Insights on Stopping Criteria Time to reach accuracy ρ 2SGD ( ) ν 1 ρ + o ρ SGD ( ) k ν 1 ρ + o ρ Number of epochs to reach same test cost as the full optimization. 1 k 1 k κ 2 There are many ways to make constant k smaller: Exact second order stochastic gradient descent. Approximate second order stochastic gradient descent. Simple preconditionning tricks.

63 Insights on Stopping Criteria Early stopping with cross validation Create a validation set by setting some training examples apart. Monitor cost function on the validation set. Stop when it stops decreasing. Early stopping a priori Extract two disjoint subsamples of training data. Train on the first subsample; stop by validating on the second. The number of epochs is an estimate of k. Train by performing that number of epochs on the full set. This is asymptotically correct and gives reasonable results in practice.

Large Scale Machine Learning with Stochastic Gradient Descent

Large Scale Machine Learning with Stochastic Gradient Descent Large Scale Machine Learning with Stochastic Gradient Descent Léon Bottou leon@bottou.org Microsoft (since June) Summary i. Learning with Stochastic Gradient Descent. ii. The Tradeoffs of Large Scale Learning.

More information

Numerical Optimization Techniques

Numerical Optimization Techniques Numerical Optimization Techniques Léon Bottou NEC Labs America COS 424 3/2/2010 Today s Agenda Goals Representation Capacity Control Operational Considerations Computational Considerations Classification,

More information

CS60021: Scalable Data Mining. Large Scale Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Stochastic Gradient Descent. CS 584: Big Data Analytics

Stochastic Gradient Descent. CS 584: Big Data Analytics Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Algorithms for NLP Language Modeling III Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Announcements Office hours on website but no OH for Taylor until next week. Efficient Hashing Closed address

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Nonlinear Optimization Methods for Machine Learning

Nonlinear Optimization Methods for Machine Learning Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks

More information

Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent

Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent Journal of Computational Information Systems 9: 15 (2013) 6251 6258 Available at http://www.jofcis.com Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent Xin ZHOU, Conghui ZHU, Sheng

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725 Quasi-Newton Methods Javier Peña Convex Optimization 10-725/36-725 Last time: primal-dual interior-point methods Consider the problem min x subject to f(x) Ax = b h(x) 0 Assume f, h 1,..., h m are convex

More information

More Optimization. Optimization Methods. Methods

More Optimization. Optimization Methods. Methods More More Optimization Optimization Methods Methods Yann YannLeCun LeCun Courant CourantInstitute Institute http://yann.lecun.com http://yann.lecun.com (almost) (almost) everything everything you've you've

More information

Classification and Pattern Recognition

Classification and Pattern Recognition Classification and Pattern Recognition Léon Bottou NEC Labs America COS 424 2/23/2010 The machine learning mix and match Goals Representation Capacity Control Operational Considerations Computational Considerations

More information

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Lecture 5 Neural models for NLP

Lecture 5 Neural models for NLP CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm

More information

Stochastic Quasi-Newton Methods

Stochastic Quasi-Newton Methods Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent

More information

Optimization Methods for Machine Learning

Optimization Methods for Machine Learning Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction

More information

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines Gautam Kunapuli Example: Text Categorization Example: Develop a model to classify news stories into various categories based on their content. sports politics Use the bag-of-words representation for this

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Probabilistic Models for Sequence Labeling

Probabilistic Models for Sequence Labeling Probabilistic Models for Sequence Labeling Besnik Fetahu June 9, 2011 Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 1 / 26 Background & Motivation Problem introduction Generative

More information

Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines

Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines Journal of Machine Learning Research 9 (2008) 1369-1398 Submitted 1/08; Revised 4/08; Published 7/08 Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines Kai-Wei Chang Cho-Jui

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

Sequence Labelling SVMs Trained in One Pass

Sequence Labelling SVMs Trained in One Pass Sequence Labelling SVMs Trained in One Pass Antoine Bordes 12, Nicolas Usunier 1, and Léon Bottou 2 1 LIP6, Université Paris 6, 104 Avenue du Pdt Kennedy, 75016 Paris, France 2 NEC Laboratories America,

More information

Advanced computational methods X Selected Topics: SGD

Advanced computational methods X Selected Topics: SGD Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety

More information

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017 Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient

More information

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems Thore Graepel and Nicol N. Schraudolph Institute of Computational Science ETH Zürich, Switzerland {graepel,schraudo}@inf.ethz.ch

More information

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015 Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about

More information

Linear classifiers: Overfitting and regularization

Linear classifiers: Overfitting and regularization Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

Dan Roth 461C, 3401 Walnut

Dan Roth   461C, 3401 Walnut CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers

More information

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the

More information

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

CTJLSVM: Componentwise Triple Jump Acceleration for Training Linear SVM

CTJLSVM: Componentwise Triple Jump Acceleration for Training Linear SVM CTJLSVM: Componentwise Triple Jump Acceleration for Training Linear SVM Han-Shen Huang, Porter Chang (Ker2) and Chun-Nan Hsu AI for Investigating Anti-cancer solutions (AIIA Lab) Institute of Information

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data January 17, 2006 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Multi-Layer Perceptrons Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole

More information

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire

More information

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16 COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS6 Lecture 3: Classification with Logistic Regression Advanced optimization techniques Underfitting & Overfitting Model selection (Training-

More information

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal IPAM Summer School 2012 Tutorial on Optimization methods for machine learning Jorge Nocedal Northwestern University Overview 1. We discuss some characteristics of optimization problems arising in deep

More information

Stochastic Optimization Methods for Machine Learning. Jorge Nocedal

Stochastic Optimization Methods for Machine Learning. Jorge Nocedal Stochastic Optimization Methods for Machine Learning Jorge Nocedal Northwestern University SIAM CSE, March 2017 1 Collaborators Richard Byrd R. Bollagragada N. Keskar University of Colorado Northwestern

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

SGD and Deep Learning

SGD and Deep Learning SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients

More information

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing Semantics with Dense Vectors Reference: D. Jurafsky and J. Martin, Speech and Language Processing 1 Semantics with Dense Vectors We saw how to represent a word as a sparse vector with dimensions corresponding

More information

Classification with Perceptrons. Reading:

Classification with Perceptrons. Reading: Classification with Perceptrons Reading: Chapters 1-3 of Michael Nielsen's online book on neural networks covers the basics of perceptrons and multilayer neural networks We will cover material in Chapters

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Conditional Random Field

Conditional Random Field Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions

More information

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Jacob Rafati http://rafati.net jrafatiheravi@ucmerced.edu Ph.D. Candidate, Electrical Engineering and Computer Science University

More information

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory. Ensemble methods. Boosting. Boosting: history Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over

More information

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017 Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper

More information

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Speaker Representation and Verification Part II. by Vasileios Vasilakakis Speaker Representation and Verification Part II by Vasileios Vasilakakis Outline -Approaches of Neural Networks in Speaker/Speech Recognition -Feed-Forward Neural Networks -Training with Back-propagation

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July

More information

Bias-Variance Tradeoff

Bias-Variance Tradeoff What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff

More information

Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

High Order Methods for Empirical Risk Minimization

High Order Methods for Empirical Risk Minimization High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu IPAM Workshop of Emerging Wireless

More information

A Parallel SGD method with Strong Convergence

A Parallel SGD method with Strong Convergence A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Large-scale machine learning Stochastic gradient descent

Large-scale machine learning Stochastic gradient descent Stochastic gradient descent IRKM Lab April 22, 2010 Introduction Very large amounts of data being generated quicker than we know what do with it ( 08 stats): NYSE generates 1 terabyte/day of new trade

More information

Statistical NLP for the Web

Statistical NLP for the Web Statistical NLP for the Web Neural Networks, Deep Belief Networks Sameer Maskey Week 8, October 24, 2012 *some slides from Andrew Rosenberg Announcements Please ask HW2 related questions in courseworks

More information

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Stochastic Analogues to Deterministic Optimizers

Stochastic Analogues to Deterministic Optimizers Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured

More information

Multiclass Classification-1

Multiclass Classification-1 CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project

More information

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017 CPSC 340: Machine Learning and Data Mining Stochastic Gradient Fall 2017 Assignment 3: Admin Check update thread on Piazza for correct definition of trainndx. This could make your cross-validation code

More information

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang Machine Learning Lecture 04: Logistic and Softmax Regression Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set

More information

Log-Linear Models, MEMMs, and CRFs

Log-Linear Models, MEMMs, and CRFs Log-Linear Models, MEMMs, and CRFs Michael Collins 1 Notation Throughout this note I ll use underline to denote vectors. For example, w R d will be a vector with components w 1, w 2,... w d. We use expx

More information

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models & Applications Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

CSC321 Lecture 9: Generalization

CSC321 Lecture 9: Generalization CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 26 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions

More information

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization Quasi-Newton Methods Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization 10-725 Last time: primal-dual interior-point methods Given the problem min x f(x) subject to h(x)

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4 Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Reading: Ben-Hur & Weston, A User s Guide to Support Vector Machines (linked from class web page) Notation Assume a binary classification problem. Instances are represented by vector

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

High Order Methods for Empirical Risk Minimization

High Order Methods for Empirical Risk Minimization High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu Thanks to: Aryan Mokhtari, Mark

More information

Machine Learning Lecture 6 Note

Machine Learning Lecture 6 Note Machine Learning Lecture 6 Note Compiled by Abhi Ashutosh, Daniel Chen, and Yijun Xiao February 16, 2016 1 Pegasos Algorithm The Pegasos Algorithm looks very similar to the Perceptron Algorithm. In fact,

More information

Statistical Ranking Problem

Statistical Ranking Problem Statistical Ranking Problem Tong Zhang Statistics Department, Rutgers University Ranking Problems Rank a set of items and display to users in corresponding order. Two issues: performance on top and dealing

More information

CSC321 Lecture 9: Generalization

CSC321 Lecture 9: Generalization CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 27 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions

More information

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018 Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1 Sequence-to-sequence modelling Problem: E.g. A sequence X 1 X N goes in A different sequence Y 1 Y M comes out Speech recognition:

More information

CSCI567 Machine Learning (Fall 2018)

CSCI567 Machine Learning (Fall 2018) CSCI567 Machine Learning (Fall 2018) Prof. Haipeng Luo U of Southern California Sep 12, 2018 September 12, 2018 1 / 49 Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due

More information

Lecture 13: Structured Prediction

Lecture 13: Structured Prediction Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page

More information

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation. ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent

More information