Large Scale Machine Learning and Stochastic Algorithms
|
|
- Steven Lee
- 6 years ago
- Views:
Transcription
1 Large Scale Machine Learning and Stochastic Algorithms Léon Bottou NEC Laboratories America Princeton
2 Machine Learning in the 1980s A Path to Artificial Intelligence? Emulate cognitive capabilities of humans. Profund, but not immediately obvious, relation with the foundations of statistics, Solution does not involve a human expert, unlike the practice of statistics. Humans learn from abundant and diverse data: vision, sound, speech, text,.... More data than we can handle.
3 Machine Learning in the 2000s A Path to Artificial Intelligence? Still my ultimate goal... Opinions vary... Automated Data Mining Gain competitive advantages by analyzing the masses of data that describes the life of our computerized society. - Placement of web advertisement. - Customer relation management. - Fraud detection. Brute force approach. More data than we can handle.
4 Linear Time Learning Algorithms? The computing resources available for learning do not grow faster than the volume of data. The cost of data mining cannot exceed the revenues. Intelligent animals learn from streaming data. Most machine learning or optimization algorithms demand resources that grow faster than the volume of data. Matrix operations (n 3 time for n 2 coefficients). Sparse matrix operations are worse.
5 Summary I. Machine Learning Redux. II. Machine Learning and Optimization. III. Stochastic Algorithms.
6 ML Redux: The Experimental Paradigm Variations: k-fold cross-validation, etc. This is the main driver for progress in machine learning.
7 ML Redux: Mathematical Statement (i) Assumption Examples are drawn independently from an unknown probability distribution P(x, y) that represents the laws of Nature. Loss Function Function l(ŷ, y) measures the cost of answering ŷ when the true answer is y. Expected Risk We seek to find the function f that minimizes: min E(f) = l( f(x), y ) dp(x, y) f Note: The test set error is an approximation of the expected risk.
8 ML Redux: Mathematical Statement (ii) Approximation Not feasible to search f among all functions. Instead, we search ff that minimizes the Expected Risk E(f) within some richly parametrized family of functions F. Estimation Not feasible to minimize the expectation E(f) because P(x, y) is unknown. Instead, we search f n that minimizes the Empirical Risk E n (f), that is, the average loss over the training set examples. n min f F E n(f) = 1 n i=1 l( f(x i ), y i ) In other words, we optimize a surrogate problem!
9 ML Redux: The Tradeoff E(f n ) E(f ) = ( E(f F ) E(f ) ) Approximation Error + ( E(f n ) E(f F )) Estimation Error Estimation error Approximation error Size of F (Vapnik and Chervonenkis, Ordered risk minimization, 1974). (Vapnik and Chervonenkis, Theorie der Zeichenerkennung, 1979)
10 The Computational Problem (i) Statistical Perspective: It is good to optimize an objective function than ensures a fast estimation rate when the number of examples increases. Optimization Perspective: To efficiently solve large problems, it is preferable to choose an optimization algorithm with strong asymptotic properties, e.g. superlinear. Incorrect Conclusion: To address large-scale learning problems, use a superlinear algorithm to optimize an objective function with fast estimation rate.
11 The Computational Problem (ii) Baseline large-scale learning algorithm Randomly discarding data is the simplest way to handle large datasets. What are the statistical benefits of processing more data? What is the computational cost of processing more data? We need a theory that joins Statistics and Computation! 1967: Vapnik and Chervonenkis theory does not discuss computation. 1981: Valiant s learnability excludes exponential time algorithms, but (i) polynomial time already too slow, (ii) few actual results. We propose a new analysis of approximate optimization...
12 Test Error Test Error versus Learning Time Bayes Limit Computing Time
13 Test Error Test Error versus Learning Time 10,000 examples 100,000 examples 1,000,000 examples Bayes limit Computing Time Vary the number of examples...
14 Test Error Test Error versus Learning Time optimizer a optimizer b optimizer c model I model II model III model IV 10,000 examples 100,000 examples 1,000,000 examples Bayes limit Computing Time Vary the number of examples, the statistical models, the algorithms,...
15 Test Error Test Error versus Learning Time optimizer a optimizer b optimizer c model I model II model III model IV Good Learning Algorithms 10,000 examples 100,000 examples 1,000,000 examples Bayes limit Computing Time Not all combinations are equal.
16 Test Error versus Learning Time $ Good Learning Algorithms Point where we should start working on something else? Bayes limit $ Changing the units along the axes...
17 Learning with Approximate Optimization Computing f n = arg min f F E n (f) is often costly. Since we already optimize a surrogate function why should we compute its optimum f n exactly? Let s assume our optimizer returns f n such that E n ( f n ) < E n (f n ) + ρ. For instance, one could stop an iterative optimization algorithm long before its convergence.
18 Decomposition of the Error (i) E( f n ) E(f ) = E(fF ) E(f ) Approximation error + E(f n ) E(fF ) Estimation error + E( f n ) E(f n ) Optimization error Problem: Choose F, n, and ρ to make this as small as possible, subject to budget constraints { max number of examples n max computing time T
19 Decomposition of the Error (ii) Approximation error bound: decreases when F gets larger. (Approximation theory) Estimation error bound: decreases when n gets larger. increases when F gets larger. (Vapnik-Chervonenkis theory) Optimization error bound: increases with ρ. (Vapnik-Chervonenkis theory plus tricks) Computing time T: decreases with ρ increases with n increases with F (Algorithm dependent)
20 Small-scale vs. Large-scale Learning We can give rigorous definitions. Definition 1: We have a small-scale learning problem when the active budget constraint is the number of examples n. Definition 2: We have a large-scale learning problem when the active budget constraint is the computing time T.
21 Small-scale Learning The active budget constraint is the number of examples. To reduce the estimation error, take n as large as the budget allows. To reduce the optimization error to zero, take ρ = 0. We need to adjust the size of F. Estimation error Approximation error Size of F See Structural Risk Minimization (Vapnik 74) and later works.
22 Large-scale Learning The active budget constraint is the computing time. More complicated tradeoffs. The computing time depends on the three variables: F, n, and ρ. Example. If we choose ρ small, we decrease the optimization error. But we must also decrease F and/or n with adverse effects on the estimation and approximation errors. The exact tradeoff depends on the optimization algorithm. We can compare optimization algorithms rigorously.
23 Executive Summary log (ρ) Best ρ Good optimization algorithm (superlinear). ρ decreases faster than exp( T) Mediocre optimization algorithm (linear). ρ decreases like exp( T) Extraordinary poor optimization algorithm ρ decreases like 1/T log(t)
24 Asymptotics E( f n ) E(f ) = E(fF ) E(f ) Approximation error + E(f n ) E(fF ) Estimation error + E( f n ) E(f n ) Optimization error Asymoptotic Approach All three errors must decrease with comparable rates. Forcing one of the errors to be decrease much faster - costs in computing time, - but does not significantly improve the test error.
25 Asymptotics: Estimation Uniform convergence bounds ([ d Estimation error O n log n ] α ) d with 1 2 α 1. Value d describes the capacity of our system. The simplest capacity measure is the Vapnik-Chervonenkis dimension of F. There are in fact three (four?) types of bounds to consider: ( ) d Classical V-C bounds (pessimistic): O n ( d Relative V-C bounds in the realizable case: O n log n ) ([ d d Localized bounds (variance, Tsybakov): O n log n ] α ) d Fast estimation rates: (Bousquet, 2002; Tsybakov, 2004; Bartlett et al., 2005;... )
26 Asymptotics: Estimation+Optimization Uniform convergence arguments give ([ d Estimation error + Optimization error O n log n ] α + ρ) d = ε. This is true for all three cases of uniform convergence bounds. Scaling laws for ρ when F is fixed The approximation error is constant. ] α ) No need to choose ρ smaller than O([ dn log n d. ] α ) Not advisable to choose ρ larger than O([ dn log n d.
27 ... Approximation+Estimation+Optimization When F is chosen via a λ-regularized cost new Uniform convergence theory provides bounds for simple cases (Massart-2000; Zhang 2005; Steinwart et al., ;... ) Scaling laws for n, λ and ρ depend on the optimization algorithm. See (Shalev-Shwartz and Srebro, ICML 2008) for Linear SVMs. When F is realistically complicated Large datasets matter because one can use more features, because one can use richer models. Bounds for such cases are rarely realistic enough.
28 Analysis of a Simple Case Simple parametric setup F is fixed. Functions f w (x) linearly parametrized by w R d. Comparing four iterative optimization algorithms for E n (f) 1. Gradient descent. 2. Second order gradient descent (Newton). 3. Stochastic gradient descent. 4. Stochastic second order gradient descent.
29 Quantities of Interest Empirical Hessian at the empirical optimum w n. H = 2 E n w 2 (f w n ) = 1 n n i=1 2 l(f n (x i ), y i ) w 2 Empirical Fisher Information matrix at the empirical optimum w n. [ n ( l(fn ) ( ) ] (x i ), y i ) l(fn (x i ), y i ) G = 1 n i=1 w w Condition number We assume that there are λ min, λ max and ν such that trace ( GH 1) ν. spectrum ( H ) [λ min, λ max ]. and we define the condition number κ = λ max /λ min.
30 Gradient Descent (GD) Iterate w t+1 w t η E n(f wt ) w Gradient J Best speed achieved with fixed learning rate η = 1/λ max. (e.g., Dennis & Schnabel, 1983) Cost per Iterations Time to reach Time to reach iteration to reach ρ accuracy ρ E( f n ) E(fF ( ) ( ) ) ) < ε GD O(nd) O κ log ρ 1 O ndκ log ρ O( 1 d 2 κ ε 1/α log2 ε 1 In the last column, n and ρ are chosen to reach ε as fast as possible. Solve for ε to find the best error rate achievable in a given time. Remark: abuses of the O() notation
31 Second Order Gradient Descent (2GD) Iterate w t+1 w t H 1 E n(f wt ) w Gradient J We assume H 1 is known in advance. Superlinear optimization speed (e.g., Dennis & Schnabel, 1983) Cost per Iterations Time to reach Time to reach iteration to reach ρ accuracy ρ E( f n ) E(fF ) < ε 2GD O ( d ( d + n )) ( ) ( O log log ρ 1 O d ( d + n ) ) ( ) log log ρ 1 O d 2 ε 1/α log ε 1 log log 1 ε Optimization speed is much faster. Learning speed only saves the condition number κ.
32 Stochastic Gradient Descent (SGD) Iterate Draw random example (x t, y t ). w t+1 w t η t l(f wt (x t ), y t ) w Total Gradient <J(x,y,w)> Partial Gradient J(x,y,w) Best decreasing gain schedule with η = 1/λ min. (see Bottou, 1991; Murata, 1998; Bottou & LeCun, 2004) Cost per Iterations Time to reach Time to reach iteration to reach ρ accuracy ρ E( f n ) E(fF ( ) ) ) ) < ε 1ρ SGD O(d) ν k ρ + o With 1 k κ 2 O( d ν k ρ O( d ν k ε Optimization speed is catastrophic. Learning speed does not depend on the statistical estimation rate α. Learning speed depends on condition number κ but scales very well.
33 Second order Stochastic Descent (2SGD) Iterate Draw random example (x t, y t ). w t+1 w t 1 t H 1 l(f w t (x t ), y t ) w Total Gradient <J(x,y,w)> Partial Gradient J(x,y,w) Replace scalar gain η t by matrix 1 t H 1. Cost per Iterations Time to reach Time to reach iteration to reach ρ accuracy ρ E( f n ) E(fF ) < ε 2SGD O ( d 2) ( ) ) ) ν ρ + o 1ρ O( d 2 ν ρ O( d 2 ν ε Each iteration is d times more expensive. The number of iterations is reduced by κ 2 (or less.) Second order only changes the constants.
34 Benchmarking SGD in Simple Problems The theory suggests that SGD is very competitive. Many people associate SGD with trouble. SGD historically associated with back-propagation. Multilayer networks are very hard problems (nonlinear, nonconvex) What is difficult, SGD or MLP? Try PLAIN SGD on a simple learning problem. Download from These simple programs are very short.
35 Text Categorization with SVMs Dataset Reuters RCV1 document corpus. 781,265 training examples, 23,149 testing examples. 47,152 TF-IDF features. Task Recognizing documents of category CCAT. Minimize 1 n ( ) λ n 2 w2 + l( w x i + b, y i ). i=1 Update w w η t (w t, x t, y t ) = w η t ( λw + l(w x t + b, y t ) w Same setup as (Shalev-Schwartz et al., 2007) but plain SGD. )
36 Text Categorization with SVMs Results: Linear SVM l(ŷ, y) = max{0, 1 yŷ} λ = Training Time Primal cost Test Error SVMLight 23,642 secs % SVMPerf 66 secs % SGD 1.4 secs % Results: Log-Loss Classifier l(ŷ, y) = log(1 + exp( yŷ)) λ = Training Time Primal cost Test Error TRON(LibLinear, ε = 0.01) 30 secs % TRON(LibLinear, ε = 0.001) 44 secs % SGD 2.3 secs %
37 The Wall 0.3 Testing cost 0.2 Training time (secs) 100 SGD 50 TRON (LibLinear) e 05 1e 06 1e 07 1e 08 1e 09 Optimization accuracy (trainingcost optimaltrainingcost)
38 Text Chunking with CRFs Dataset CONLL 2000 Chunking Task: Segment sentences in syntactically correlated chunks (e.g., noun phrases, verb phrases.) 106,978 training segments in 8936 sentences. 23,852 testing segments in 2012 sentences. Model Conditional Random Field (all linear, log-loss.) Features are n-grams of words and part-of-speech tags. 1,679,700 parameters. Same setup as (Vishwanathan et al., 2006) but plain SGD.
39 Text Chunking with CRFs Results Training Time Primal cost Test F1 score L-BFGS 4335 secs % SGD 568 secs % Notes Computing the gradients with the chain rule runs faster than computing them with the forward-backward algorithm. Graph Transformer Networks are nonlinear conditional random fields trained with stochastic gradient descent (Bottou et al., 1997).
40 More SVM Experiments From: Patrick Haffner Date: Wednesday :28:50... I have tried on some of our main datasets... I can send you the example, it is so striking! Patrick Dataset Train Number of % non-0 LIBSVM LLAMA LLAMA SGDSVM size features features (SDot) SVM MAXENT Reuters 781K 47K 0.1% 210, Translation 1000K 274K % days 47,700 1,105 7 SuperTag 950K 46K % 31, Voicetone 579K 88K 0.019% 39,
41 More SVM Experiments From: Olivier Chapelle Date: Sunday :26:44... you should really run batch with various training set sizes... Average Test Loss n=10000 n= n= n=30000 n= stochastic Time (seconds) Log-loss problem Batch Conjugate Gradient on various training set sizes Stochastic Gradient on the full set Why is SGD near the enveloppe?
42 Effect of one Additional Example (i) Compare w n w n+1 = arg min w = arg min w E n (f w ) E n+1 (f w ) = arg min w [E n (f w ) + 1n l( f w (x n+1 ), y n+1 ) ] n+1 n En+1 (f ) w E n(f w) w* n+1 w* n
43 Effect of one Additional Example (ii) First Order Calculation wn+1 = w n 1 l ( ) ( ) f wn (x n ), y n 1 n H 1 n+1 + O w n 2 where H n+1 is the empirical Hessian on n + 1 examples. Compare with Second Order Stochastic Gradient Descent w t+1 = w t 1 t H 1 l( f wt (x n ), y n ) w Could they converge with the same speed? C 2 assumptions = Accurate speed estimates.
44 Speed of Scaled Stochastic Gradient Study w t+1 = w t 1 t B t ) l (f wt (x n ),y n w + O ( ) 1 t with Bt B 0, BH I/2. 2 Establish convergence a.s. via quasi-martingales (see Bottou, 1991, 1998). Let U t = H (w t w ) (w t w ). Observe E(f wt ) E(f w ) = tr(u t ) + o(tr(u t )) Derive E t (U t+1 ) = [ I 2BH t + o ( )] 1 t Ut + HBGB t + o ( 1 2 Lemma: study real sequence u t+1 = ( 1 + α t + o( 1 When α > 1 show u t = β 1 α 1 t + o( 1 t) (nasty proof!). When α < 1 show u t t α (up to log factors). ) t where G is the Fisher matrix. 2 )) t ut + β t + o ( ) 1 2 t. 2 Bracket E(tr(U t+1 )) between two such sequences and conclude: ( ) tr(hbgb) 1 1 2λ max BH 1 t +o t Interesting special cases: B = I/λ min H and B = H 1. E [ E(f wt ) E(f w ) ] tr(hbgb) ( ) 1 1 2λ min BH 1 t +o t After (Bottou & LeCun, 2003), see also (Fabian, 1973; Murata & Amari, 1998).
45 Chasing the constants pays. Follow the leader Second-order SGD lim n n E[ E(f w n ) E(f F ) ] = lim t E[ E(f wt ) E(f F ) ] = tr(g H 1 ) t lim n n E[ w wn 2] = lim t t E[ w w t 2] = tr(h 1 G H 1 ) Best solution in F. One Pass of Second Order Stochastic Gradient w n K/n w =w* w 0 = w* 0 Empirical Optima w* n Best training set error.
46 Optimal Learning in One Pass Given a large enough training set, a Single Pass of Second Order Stochastic Gradient generalizes as well as the Empirical Optimum. Experiments on synthetic data Mse* +1e 1 Mse* +1e 2 Mse* +1e 3 Mse* +1e Number of examples Milliseconds
47 What matters is the Constant Test set performance decreases in K/t. Constant K depends on: Second order versus first order optimization. How quickly we reach the asymptotic constant. Number of operations per iteration. Parallel programming. Programming language. Programmer s skill. All on an equal footing!
48 Unfortunate Theoretical Issues The dangers of asymptotic analysis! Asymptotic one pass learning whenever B t H 1. E [ E(f wt ) E(f w ) ] = tr(g H 1 ) t ( ) 1 + o t What happens when B t H 1 slowly? One-pass learning speed regime may not be reached in one pass...
49 Unfortunate Practical Issues Second Order SGD is not that fast! w t+1 w t 1 t B t l(f wt (x t ), y t ) w Must estimate and store d d matrix H 1. Must multiply the gradient for each example by the matrix H 1. Sparsity tricks no longer work because H 1 is not sparse.
50 Directions in Stochastic Algorithm Design Limited storage approximations of H 1 Reduce the number of epochs Rarely sufficient for fast one-pass learning. Diagonal approximation (Becker &LeCun, 1989, Bordes & Bottou, 2007) Low rank approximation (e.g., LeCun et al., 1998, LeRoux et al., 2007) Online L-BFGS approximation (Schraudolph, 2007) Exploiting Duality Coordinate ascent in the dual is related to SGD. Dual Hessian sometimes more amenable to approximation. Some successes (Bordes and Bottou, ). Long term perspective uncertain. Averaged SGD Good asymptotic constants (Polyak and Juditsky 1992) Asymptotic regime long to settle... possibly fixable (Wei Xu, to appear)
51 Engineering Algorithms - Sparsity The very simple SGD update offers lots of engineering opportunities. Example: Sparse Linear SVM The update w w η ( λw + l(wx i, y i ) ) can be performed in two steps: i) w w η l(wx i, y i ) (sparse, cheap) ii) w w (1 ηλ) (not sparse, costly) Solution Perform only step (i) for each training example. Perform step (ii) with lower frequency and higher gain. General Idea Scheduling SGD operations according to their cost!
52 Engineering Algorithms SGD-QN (i) SGD-QN - A Stochastic Gradient SVM. Official winner of the Pascal Large Scale Learning Challenge Diagonal Hessian Approximation Gain on the condition numbers with lowest computational cost. Schedule of SGD operations according to their costs - regularization versus example updates - reestimation of the diagonal hessian. Careful engineering makes it run faster (although we did not get any points for that.) Design Principle Start with simple SGD. Incremental improvements of the constants. (Bordes, Bottou, Gallinary, 2008)
53 Engineering Algorithms SGD-QN (ii) Require: λ, w 0, t 0, T 1: t = 0 SVMSGD0 2: while t T do 3: w t+1 = w t 1 λ(t+t 0 ) (λw t + l (y t w tx t )y t x t ) 4: 5: 6: 7: 8: 9: t = t : end while 11: return w T SVMSGD2 Require: λ, w 0, t 0, T, skip 1: t = 0, count =skip 2: while t T do 3: w t+1 = w t 1 λ(t+t 0 ) l (y t w tx t )y t x t 4: count = count 1 5: if count < 0 then 6: w t+1 = w t+1 skip t+t 0 w t+1 7: count = skip 8: end if 9: t = t : end while 11: return w T
54 SVMSGD2 Require: λ, w 0, t 0, T, skip 1: t = 0, count =skip 2: 3: while t T do 4: w t+1 = w t 1 λ(t+t 0 ) l (y t w tx t )y t x t 5: 6: 7: 8: 9: 10: 11: count = count 1 12: if count < 0 then 13: w t+1 = w t+1 skip (t + t 0 ) 1 w t+1 14: count = skip 15: end if 16: t = t : end while 18: return w T SGD-QN Require: λ, w 0, t 0, T, skip 1: t = 0, count =skip, 2: B = λ 1 I, updateb =false, r = 0 3: while t T do 4: w t+1 = w t (t + t 0 ) 1 l (y t w tx t )y t B x t 5: if updateb = true then 6: p t = g t (w t+1 ) g( t (w t ) ) 7: i, B ii = B ii + 2 r [wt+1 w t ] i [p t ] 1 i B ii 8: i, B ii = max(b ii, 10 2 λ 1 ) 9: r = r + 1, updateb =false 10: end if 11: count = count 1 12: if count < 0 then 13: w t+1 = w t+1 skip (t + t 0 ) 1 λ B w t+1 14: count = skip, updateb = true 15: end if 16: t = t : end while 18: return w T
55 Engineering Algorithms SGD-QN (iii) SVMSGD2 SGD-QN olbfgs LibLinear SVMSGD2 SGD-QN olbfgs LibLinear Number of epochs Training time (sec.) Pascal Large Scale Challenge Delta dataset Test errors (in %) according to the number of epochs (left) and training duration (right).
56 SGD in Real Life: A Check Reader. Examples are pairs (image,amount). Field segmentation Character segmentation Character recognition Syntactical interpretation. Define differentiable modules. Pretrain modules with hand-labelled data. Assemble them to form a composite model. Define global cost function. Train with SGD for a few weeks. (Bottou, LeCun, Bengio, 1997; LeCun, Bottou, et al., 1998) Industrially deployed; ran billions of checks since 1996.
57 SGD in Real Life: Sentence Analysis Binary encoded sentence words. Words embedded in dim space Five Time Delay Multilayer networks : Part Of Speech Tagging ERR: 2.76% ( treebank, split / 23 ) State of the art: 2.75% Named Entity Recognition ( treebank, Stanford NER ) Chunking ( treebank ) F1: 88.97% State of the art: 89.31% F1: 92.7% ( 94.9% w/pos ) State of the art: 94.4%. Semantic Role Labeling ( propbank ) WER: ~14% State of the art: ~13% Positional information relative to the chosen predicate for semantic tagging Language Model ( wikipedia, 620M examples ) Trash (Collobert and Weston, 2007, 2008) [not my work] No hand-tuned parsing tricks. No hand-tuned linguistic features. State-of-the-art accuracies. Analyzes a sentence in 50 milliseconds (instead of seconds.) Trains with SGD in about 3 weeks.
58 Conclusions Qualitatively different tradeoffs for small and large scale learning problems. Traditional performance measurements for optimization algorithms do not apply very well to machine learning problems Stochastic algorithms provide excellent performance, both in theory and in practice. There is room for improvements. Engineering matters!
59 The End
60 Insights on the Gain Schedule (i) Decreasing gains: w t+1 w t η t + t 0 (w t, x t, y t ) Asymptotic Theory if s = 2η λ min < 1 then slow rate O ( t s) if s = 2η λ min > 1 then faster rate O ( s 2 s 1 t 1) Example: the SVM benchmark Use η = 1/λ because λ λ min. Choose t 0 to make sure that the expected initial updates are comparable with the expected size of the weights.
61 Insights on the Gain Schedule (ii) The sample size n does not change the SGD maths! Constant gain: w t+1 w t η (w t, x t, y t ) At any moment during training, we can: Pick a random subset of examples with moderate size. Try various gains η on the subsample. Pick the gain η that most reduces the cost. Use it for the next iterations on the full dataset.
62 Insights on Stopping Criteria Time to reach accuracy ρ 2SGD ( ) ν 1 ρ + o ρ SGD ( ) k ν 1 ρ + o ρ Number of epochs to reach same test cost as the full optimization. 1 k 1 k κ 2 There are many ways to make constant k smaller: Exact second order stochastic gradient descent. Approximate second order stochastic gradient descent. Simple preconditionning tricks.
63 Insights on Stopping Criteria Early stopping with cross validation Create a validation set by setting some training examples apart. Monitor cost function on the validation set. Stop when it stops decreasing. Early stopping a priori Extract two disjoint subsamples of training data. Train on the first subsample; stop by validating on the second. The number of epochs is an estimate of k. Train by performing that number of epochs on the full set. This is asymptotically correct and gives reasonable results in practice.
Large Scale Machine Learning with Stochastic Gradient Descent
Large Scale Machine Learning with Stochastic Gradient Descent Léon Bottou leon@bottou.org Microsoft (since June) Summary i. Learning with Stochastic Gradient Descent. ii. The Tradeoffs of Large Scale Learning.
More informationNumerical Optimization Techniques
Numerical Optimization Techniques Léon Bottou NEC Labs America COS 424 3/2/2010 Today s Agenda Goals Representation Capacity Control Operational Considerations Computational Considerations Classification,
More informationCS60021: Scalable Data Mining. Large Scale Machine Learning
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationStochastic Gradient Descent. CS 584: Big Data Analytics
Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic
More informationOptimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationLecture 5: Logistic Regression. Neural Networks
Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture
More informationAlgorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley
Algorithms for NLP Language Modeling III Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Announcements Office hours on website but no OH for Taylor until next week. Efficient Hashing Closed address
More informationLogistic Regression: Online, Lazy, Kernelized, Sequential, etc.
Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationNonlinear Optimization Methods for Machine Learning
Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks
More informationLarge Scale Semi-supervised Linear SVM with Stochastic Gradient Descent
Journal of Computational Information Systems 9: 15 (2013) 6251 6258 Available at http://www.jofcis.com Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent Xin ZHOU, Conghui ZHU, Sheng
More informationStatistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.
http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationQuasi-Newton Methods. Javier Peña Convex Optimization /36-725
Quasi-Newton Methods Javier Peña Convex Optimization 10-725/36-725 Last time: primal-dual interior-point methods Consider the problem min x subject to f(x) Ax = b h(x) 0 Assume f, h 1,..., h m are convex
More informationMore Optimization. Optimization Methods. Methods
More More Optimization Optimization Methods Methods Yann YannLeCun LeCun Courant CourantInstitute Institute http://yann.lecun.com http://yann.lecun.com (almost) (almost) everything everything you've you've
More informationClassification and Pattern Recognition
Classification and Pattern Recognition Léon Bottou NEC Labs America COS 424 2/23/2010 The machine learning mix and match Goals Representation Capacity Control Operational Considerations Computational Considerations
More informationMachine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationLecture 5 Neural models for NLP
CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm
More informationStochastic Quasi-Newton Methods
Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent
More informationOptimization Methods for Machine Learning
Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction
More informationCS6375: Machine Learning Gautam Kunapuli. Support Vector Machines
Gautam Kunapuli Example: Text Categorization Example: Develop a model to classify news stories into various categories based on their content. sports politics Use the bag-of-words representation for this
More informationStatistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.
http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT
More informationProbabilistic Models for Sequence Labeling
Probabilistic Models for Sequence Labeling Besnik Fetahu June 9, 2011 Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 1 / 26 Background & Motivation Problem introduction Generative
More informationCoordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines
Journal of Machine Learning Research 9 (2008) 1369-1398 Submitted 1/08; Revised 4/08; Published 7/08 Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines Kai-Wei Chang Cho-Jui
More informationAdaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade
Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random
More informationSequence Labelling SVMs Trained in One Pass
Sequence Labelling SVMs Trained in One Pass Antoine Bordes 12, Nicolas Usunier 1, and Léon Bottou 2 1 LIP6, Université Paris 6, 104 Avenue du Pdt Kennedy, 75016 Paris, France 2 NEC Laboratories America,
More informationAdvanced computational methods X Selected Topics: SGD
Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety
More informationSimple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017
Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient
More informationStable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems
Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems Thore Graepel and Nicol N. Schraudolph Institute of Computational Science ETH Zürich, Switzerland {graepel,schraudo}@inf.ethz.ch
More informationPart of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015
Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about
More informationLinear classifiers: Overfitting and regularization
Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x
More informationOptimization for Machine Learning
Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS
More informationMachine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6
Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)
More informationDan Roth 461C, 3401 Walnut
CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers
More informationPart-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287
Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the
More informationTopics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families
Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationAd Placement Strategies
Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad
More informationCTJLSVM: Componentwise Triple Jump Acceleration for Training Linear SVM
CTJLSVM: Componentwise Triple Jump Acceleration for Training Linear SVM Han-Shen Huang, Porter Chang (Ker2) and Chun-Nan Hsu AI for Investigating Anti-cancer solutions (AIIA Lab) Institute of Information
More informationStatistical Machine Learning from Data
January 17, 2006 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Multi-Layer Perceptrons Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole
More informationApprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning
Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire
More informationCOMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16
COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS6 Lecture 3: Classification with Logistic Regression Advanced optimization techniques Underfitting & Overfitting Model selection (Training-
More informationIPAM Summer School Optimization methods for machine learning. Jorge Nocedal
IPAM Summer School 2012 Tutorial on Optimization methods for machine learning Jorge Nocedal Northwestern University Overview 1. We discuss some characteristics of optimization problems arising in deep
More informationStochastic Optimization Methods for Machine Learning. Jorge Nocedal
Stochastic Optimization Methods for Machine Learning Jorge Nocedal Northwestern University SIAM CSE, March 2017 1 Collaborators Richard Byrd R. Bollagragada N. Keskar University of Colorado Northwestern
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationSGD and Deep Learning
SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients
More informationSemantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing
Semantics with Dense Vectors Reference: D. Jurafsky and J. Martin, Speech and Language Processing 1 Semantics with Dense Vectors We saw how to represent a word as a sparse vector with dimensions corresponding
More informationClassification with Perceptrons. Reading:
Classification with Perceptrons Reading: Chapters 1-3 of Michael Nielsen's online book on neural networks covers the basics of perceptrons and multilayer neural networks We will cover material in Chapters
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationConditional Random Field
Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions
More informationImproving L-BFGS Initialization for Trust-Region Methods in Deep Learning
Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Jacob Rafati http://rafati.net jrafatiheravi@ucmerced.edu Ph.D. Candidate, Electrical Engineering and Computer Science University
More informationLearning theory. Ensemble methods. Boosting. Boosting: history
Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over
More informationNon-Convex Optimization. CS6787 Lecture 7 Fall 2017
Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper
More informationSpeaker Representation and Verification Part II. by Vasileios Vasilakakis
Speaker Representation and Verification Part II by Vasileios Vasilakakis Outline -Approaches of Neural Networks in Speaker/Speech Recognition -Feed-Forward Neural Networks -Training with Back-propagation
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July
More informationBias-Variance Tradeoff
What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff
More informationLecture 1: Supervised Learning
Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationLinear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training
More informationDay 3 Lecture 3. Optimizing deep networks
Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient
More informationStochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine
More informationHigh Order Methods for Empirical Risk Minimization
High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu IPAM Workshop of Emerging Wireless
More informationA Parallel SGD method with Strong Convergence
A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationLarge-scale machine learning Stochastic gradient descent
Stochastic gradient descent IRKM Lab April 22, 2010 Introduction Very large amounts of data being generated quicker than we know what do with it ( 08 stats): NYSE generates 1 terabyte/day of new trade
More informationStatistical NLP for the Web
Statistical NLP for the Web Neural Networks, Deep Belief Networks Sameer Maskey Week 8, October 24, 2012 *some slides from Andrew Rosenberg Announcements Please ask HW2 related questions in courseworks
More informationKernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning
Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:
More informationAccelerating SVRG via second-order information
Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More informationStochastic Analogues to Deterministic Optimizers
Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured
More informationMulticlass Classification-1
CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass
More informationCSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18
CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H
More informationStatistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields
Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project
More informationCPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017
CPSC 340: Machine Learning and Data Mining Stochastic Gradient Fall 2017 Assignment 3: Admin Check update thread on Piazza for correct definition of trainndx. This could make your cross-validation code
More informationMachine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang
Machine Learning Lecture 04: Logistic and Softmax Regression Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set
More informationLog-Linear Models, MEMMs, and CRFs
Log-Linear Models, MEMMs, and CRFs Michael Collins 1 Notation Throughout this note I ll use underline to denote vectors. For example, w R d will be a vector with components w 1, w 2,... w d. We use expx
More informationProbabilistic Graphical Models & Applications
Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with
More informationCase Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!
Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:
More informationCSC321 Lecture 9: Generalization
CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 26 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions
More informationQuasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization
Quasi-Newton Methods Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization 10-725 Last time: primal-dual interior-point methods Given the problem min x f(x) subject to h(x)
More informationNeural Networks and Deep Learning
Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost
More informationNeural Networks Learning the network: Backprop , Fall 2018 Lecture 4
Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:
More informationSupport Vector Machines
Support Vector Machines Reading: Ben-Hur & Weston, A User s Guide to Support Vector Machines (linked from class web page) Notation Assume a binary classification problem. Instances are represented by vector
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationHigh Order Methods for Empirical Risk Minimization
High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu Thanks to: Aryan Mokhtari, Mark
More informationMachine Learning Lecture 6 Note
Machine Learning Lecture 6 Note Compiled by Abhi Ashutosh, Daniel Chen, and Yijun Xiao February 16, 2016 1 Pegasos Algorithm The Pegasos Algorithm looks very similar to the Perceptron Algorithm. In fact,
More informationStatistical Ranking Problem
Statistical Ranking Problem Tong Zhang Statistics Department, Rutgers University Ranking Problems Rank a set of items and display to users in corresponding order. Two issues: performance on top and dealing
More informationCSC321 Lecture 9: Generalization
CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 27 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions
More informationDeep Learning Sequence to Sequence models: Attention Models. 17 March 2018
Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1 Sequence-to-sequence modelling Problem: E.g. A sequence X 1 X N goes in A different sequence Y 1 Y M comes out Speech recognition:
More informationCSCI567 Machine Learning (Fall 2018)
CSCI567 Machine Learning (Fall 2018) Prof. Haipeng Luo U of Southern California Sep 12, 2018 September 12, 2018 1 / 49 Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due
More informationLecture 13: Structured Prediction
Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page
More informationSparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent
More information