Large Scale Machine Learning with Stochastic Gradient Descent
|
|
- Katrina McCormick
- 6 years ago
- Views:
Transcription
1 Large Scale Machine Learning with Stochastic Gradient Descent Léon Bottou Microsoft (since June)
2 Summary i. Learning with Stochastic Gradient Descent. ii. The Tradeoffs of Large Scale Learning. iii. Asymptotic Analysis. iv. Learning with a Single Pass. Léon Bottou 2/37
3 I. Learning with Stochastic Gradient Descent Léon Bottou 3/37
4 Example Binary classification Patterns x. Classes y = ±1. Linear model Choose features: Φ(x) R d Linear discriminant function: f w (x) = sign ( ) w Φ(x) Léon Bottou 4/37
5 SVM training Choose loss function Q(x, y, w) = l(y, f w (x)) = (e.g.) log (1 ) + e y w Φ(x) Cannot minimize the expected risk E(w) = Q(x, y, w) dp (x, y). Can compute the empirical risk E n (w) = 1 n n i=1 Q(x i, y i, w). Minimize L 2 regularized empirical risk min w λ 2 w n n i=1 Q(x i, y i, w) Choosing λ is the same setting a constraint w 2 < B. Léon Bottou 5/37
6 Batch versus Online Batch: process all examples together (GD) Example: minimization by gradient descent Repeat: w w γ λw + 1 n Q n w (x i, y i, w) i=1 Online: process examples one by one (SGD) Example: minimization by stochastic gradient descent Repeat: (a) Pick random example x t, y ( t (b) w w γ t λw + Q ) w (x t, y t, w) Léon Bottou 6/37
7 Second order optimization Batch: (2GD) Example: Newton s algorithm Repeat: w w H 1 λw + 1 n n i=1 Q w (x i, y i, w) Online: (2SGD) Example: Second order stochastic gradient descent Repeat: (a) Pick random example x t, y ( t (b) w w γ t H 1 λw + Q ) w (x t, y t, w) Léon Bottou 7/37
8 More SGD Algorithms Adaline (Widrow and Hoff, 1960) ( Q adaline = 1 2 y w Φ(x) ) 2 Φ(x) R d, y = ±1 Perceptron (Rosenblatt, 1957) Q perceptron = max{0, y w Φ(x)} Φ(x) R d, y = ±1 Multilayer perceptrons (Rumelhart et al., 1986)... SVM (Cortes and Vapnik, 1995)... w w + γ t ( yt w Φ(x t ) ) Φ(x t ) w w + γ t { yt Φ(x t ) if y t w Φ(x t ) 0 0 otherwise Lasso (Tibshirani, 1996) ( Q lasso = λ w y w Φ(x) ) 2 w = (u 1 v 1,..., u d v d ) Φ(x) R d, y R, λ > 0 K-Means (MacQueen, 1967) 1 Q kmeans = min k 2 (z w k) 2 z R d, w 1... w k R d n 1... n k N, initially 0 u i [ u i γ t ( λ (yt w Φ(x t ))Φ i (x t ) )] + v i [ v i γ t ( λ + (yt w t Φ(x t ))Φ i (x t ) )] + with notation [x] + = max{0, x}. k = arg min k (z t w k ) 2 n k n k + 1 w k w k + 1 n k (z t w k ) Léon Bottou 8/37
9 II. The Tradeoffs of Large Scale Learning Léon Bottou 9/37
10 The Computational Problem Baseline large-scale learning algorithm Randomly discarding data is the simplest way to handle large datasets. What is the statistical benefit of processing more data? What is the computational cost of processing more data? We need a theory that links Statistics and Computation! 1967: Vapnik s theory does not discuss computation. 1981: Valiant s learnability excludes exponential time algorithms, but (i) polynomial time already too slow, (ii) few actual results. Léon Bottou 10/37
11 Decomposition of the Error E( f n ) E(f ) = E(f F ) E(f ) Approximation error (E app ) + E(f n ) E(f F ) Estimation error (E est) + E( f n ) E(f n ) Optimization error (E opt ) Problem: Choose F, n, and ρ to make this as small as possible, subject to budget constraints { max number of examples n max computing time T Note: choosing λ is the same as choosing F. Léon Bottou 11/37
12 Small-scale Learning The active budget constraint is the number of examples. To reduce the estimation error, take n as large as the budget allows. To reduce the optimization error to zero, take ρ = 0. We need to adjust the size of F. Estimation error Approximation error Size of F See Structural Risk Minimization (Vapnik 74) and later works. Léon Bottou 12/37
13 Large-scale Learning The active budget constraint is the computing time. More complicated tradeoffs. The computing time depends on the three variables: F, n, and ρ. Example. If we choose ρ small, we decrease the optimization error. But we must also decrease F and/or n with adverse effects on the estimation and approximation errors. The exact tradeoff depends on the optimization algorithm. We can compare optimization algorithms rigorously. Léon Bottou 13/37
14 Test Error Test Error versus Learning Time Bayes Limit Computing Time Léon Bottou 14/37
15 Test Error Test Error versus Learning Time 10,000 examples 100,000 examples 1,000,000 examples Bayes limit Computing Time Vary the number of examples... Léon Bottou 15/37
16 Test Error Test Error versus Learning Time optimizer a optimizer b optimizer c model I model II model III model IV 10,000 examples 100,000 examples 1,000,000 examples Bayes limit Computing Time Vary the number of examples, the statistical models, the algorithms,... Léon Bottou 16/37
17 Test Error Test Error versus Learning Time optimizer a optimizer b optimizer c model I model II model III model IV Good Learning Algorithms 10,000 examples 100,000 examples 1,000,000 examples Bayes limit Computing Time Not all combinations are equal. Let s compare the red curve for different optimization algorithms. Léon Bottou 17/37
18 III. Asymptotic Analysis Léon Bottou 18/37
19 Asymptotic Analysis E( f n ) E(f ) = E = E app + E est + E opt Asymptotic Analysis All three errors must decrease with comparable rates. Forcing one of the errors to decrease much faster - would require additional computing efforts, - but would not significantly improve the test error. Léon Bottou 19/37
20 Statistics Asymptotics of the statistical components of the error Thanks to refined uniform convergence arguments ( ) log n α E = E app + E est + E opt E app + + ρ n with exponent 1 2 α 1. Asymptotically effective large scale learning Must choose F, n, and ρ such that E E app E est E opt ( ) log n α ρ. n What about optimization times? Léon Bottou 20/37
21 Statistics and Computation GD 2GD SGD 2SGD Time per iteration : n n 1 1 Iters to accuracy ρ : log 1 ρ log log 1 ρ Time to accuracy ρ : n log 1 ρ n log log 1 ρ 1 ρ 1 ρ 1 ρ 1 ρ Time to error E : 1 E 1/α log2 1 E 1 E 1/α log 1 E log log 1 E 1 E 1 E 2GD optimizes much faster than GD. SGD optimization speed is catastrophic. SGD learns faster than both GD and 2GD. 2SGD only changes the constants. Léon Bottou 21/37
22 Experiment: Text Categorization Dataset Reuters RCV1 document corpus. 781,265 training examples, 23,149 testing examples. Task Recognizing documents of category CCAT. 47,152 TF-IDF features. Linear SVM. Same setup as (Joachims, 2006) and (Shalev-Schwartz et al., 2007) using plain SGD. Léon Bottou 22/37
23 Experiment: Text Categorization Results: Hinge-loss SVM Q(x, y, w) = max{0, 1 yw Φ(x)} λ = Training Time Primal cost Test Error SVMLight 23,642 secs % SVMPerf 66 secs % SGD 1.4 secs % Results: Log-Loss SVM Q(x, y, w) = log(1 + exp( yw Φ(x))) λ = Training Time Primal cost Test Error TRON(LibLinear, ε = 0.01) 30 secs % TRON(LibLinear, ε = 0.001) 44 secs % SGD 2.3 secs % Léon Bottou 23/37
24 The Wall 0.3 Testing cost 0.2 Training time (secs) 100 SGD 50 TRON (LibLinear) e 05 1e 06 1e 07 1e 08 1e 09 Optimization accuracy (trainingcost optimaltrainingcost) Léon Bottou 24/37
25 IV. Learning with a Single Pass Léon Bottou 25/37
26 Batch and online paths True solution, ONLINE Best generalization. one pass over * examples {z1...zt} w wt w * t w1 Best training set error. BATCH many iterations on examples {z1...zt} Léon Bottou 26/37
27 Effect of one Additional Example (i) Compare w n w n+1 = arg min w = arg min w E n (f w ) E n+1 (f w ) = arg min w [E n (f w ) + 1n l( f w (x n+1 ), y n+1 ) ] n+1 n En+1 (f ) w E n(f w) w* n+1 w* n Léon Bottou 27/37
28 Effect of one Additional Example (ii) First Order Calculation wn+1 = w n 1 l ( ) ( ) f wn (x n ), y n 1 n H 1 n+1 + O w n 2 where H n+1 is the empirical Hessian on n + 1 examples. Compare with Second Order Stochastic Gradient Descent w t+1 = w t 1 t H 1 l( f wt (x n ), y n ) w Could they converge with the same speed? C 2 assumptions = Accurate speed estimates. Léon Bottou 28/37
29 Speed of Scaled Stochastic Gradient Study w t+1 = w t 1 t B t ) l (f wt (x n ),y n w + O ( ) 1 t with Bt B 0, BH I/2. 2 Establish convergence a.s. via quasi-martingales (see Bottou, 1991, 1998). Let U t = H (w t w ) (w t w ). Observe E(f wt ) E(f w ) = tr(u t ) + o(tr(u t )) Derive E t (U t+1 ) = [ I 2BH t + o ( )] 1 t Ut + HBGB t + o ( ) 1 2 t where G is the Fisher matrix. 2 Lemma: study real sequence u t+1 = ( 1 + α t + o( )) 1 t ut + β t + o ( ) 1 2 t. 2 When α > 1 show u t = β 1 α 1 t + o( 1 t) (nasty proof!). When α < 1 show u t t α (up to log factors). Bracket E(tr(U t+1 )) between two such sequences and conclude: ( ) tr(hbgb) 1 1 2λ max BH 1 t +o t E [ E(f wt ) E(f w ) ] tr(hbgb) ( ) 1 1 2λ min BH 1 t +o t Interesting special cases: B = I/λ min H and B = H 1. Léon Bottou 29/37
30 Asymptotic Efficiency of Second Order SGD. Empirical optima Second-order SGD n E [ E(f w n ) E(f F ) ] = lim E(f wt ) E(f F ) ] t lim n n E[ w w n 2] = lim t E[ w w t 2] t Best solution in F. One Pass of Second Order Stochastic Gradient w n K/n w =w* w 0 = w* 0 Empirical Optima w* n Best training set error. (Fabian, 1973; Murata & Amari, 1998; Bottou & LeCun, 2003). Léon Bottou 30/37
31 Optimal Learning in One Pass A Single Pass of Second Order Stochastic Gradient generalizes as well as the Empirical Optimum. Experiments on synthetic data Mse* +1e 1 Mse* +1e 2 Mse* +1e 3 Mse* +1e Number of examples Milliseconds Léon Bottou 31/37
32 Unfortunate Issues Unfortunate theoretical issue How long to reach the asymptotic regime? One-pass learning speed regime may not be reached in one pass... Unfortunate practical issue Second order SGD is rarely feasible. estimate and store d d matrix H 1. multiply the gradient for each example by this matrix H 1. Léon Bottou 32/37
33 Solutions Limited storage approximations of H 1 Diagonal Gauss-Newton (Becker and Lecun, 1989) Low rank approximation [olbfgs], (Schraudolph et al., 2007) Diagonal approximation [SGDQN], (Bordes et al., 2009) Averaged stochastic gradient Perform SGD with slowly decreasing gains, e.g. γ t t Compute averages w t+1 = t t+1 w t + 1 t w t+1 Same asymptotic speed as 2SGD (Polyak and Juditsky, 92) Can take a while to reach the asymptotic regime. Léon Bottou 33/37
34 Experiment: ALPHA dataset From the 2008 Pascal Large Scale Learning Challenge. ( 2. Loss: Q(x, y, w) = max{0, 1 y w x}) SGD, SGDQN: γ t = γ 0 (1 + γ 0 λt) 1. ASGD: γ t = γ 0 (1 + γ 0 λt) SGD SGDQN ASGD SGD SGDQN ASGD Expected risk Test Error (%) Number of epochs Number of epochs ASGD nearly reaches the optimal expected risk after a single pass. Léon Bottou 34/37
35 Experiment: Conditional Random Field CRF for the CONLL 2000 Chunking task. 1.7M parameters. 107,000 training segments Test loss SGD SGDQN ASGD epochs Test FB1 score SGD SGDQN ASGD epochs SGDQN more attractive than ASGD. Training times: 500s (SGD), 150s (ASGD), 75s (SGDQN). Standard LBFGS optimizer needs 72 minutes. Léon Bottou 35/37
36 V. Conclusions Léon Bottou 36/37
37 Conclusions Good optimization algorithm good learning algorithm. SGD is a poor optimization algorithm. SGD is a good learning algorithm for large scale problems. SGD variants can learn in a single pass (given enough data) Léon Bottou 37/37
Large Scale Machine Learning and Stochastic Algorithms
Large Scale Machine Learning and Stochastic Algorithms Léon Bottou NEC Laboratories America Princeton Machine Learning in the 1980s A Path to Artificial Intelligence? Emulate cognitive capabilities of
More informationNumerical Optimization Techniques
Numerical Optimization Techniques Léon Bottou NEC Labs America COS 424 3/2/2010 Today s Agenda Goals Representation Capacity Control Operational Considerations Computational Considerations Classification,
More informationOptimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic
More informationCS60021: Scalable Data Mining. Large Scale Machine Learning
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance
More informationLarge Scale Semi-supervised Linear SVM with Stochastic Gradient Descent
Journal of Computational Information Systems 9: 15 (2013) 6251 6258 Available at http://www.jofcis.com Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent Xin ZHOU, Conghui ZHU, Sheng
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationAdaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade
Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationCoordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines
Journal of Machine Learning Research 9 (2008) 1369-1398 Submitted 1/08; Revised 4/08; Published 7/08 Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines Kai-Wei Chang Cho-Jui
More informationStochastic Gradient Descent. CS 584: Big Data Analytics
Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration
More informationKernelized Perceptron Support Vector Machines
Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1 The perceptron algorithm [Rosenblatt 58, 62] Classification setting:
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationTopics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families
Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines
More informationA Distributed Solver for Kernelized SVM
and A Distributed Solver for Kernelized Stanford ICME haoming@stanford.edu bzhe@stanford.edu June 3, 2015 Overview and 1 and 2 3 4 5 Support Vector Machines and A widely used supervised learning model,
More informationDay 3 Lecture 3. Optimizing deep networks
Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient
More informationReading Group on Deep Learning Session 1
Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular
More informationMore Optimization. Optimization Methods. Methods
More More Optimization Optimization Methods Methods Yann YannLeCun LeCun Courant CourantInstitute Institute http://yann.lecun.com http://yann.lecun.com (almost) (almost) everything everything you've you've
More informationClassification and Pattern Recognition
Classification and Pattern Recognition Léon Bottou NEC Labs America COS 424 2/23/2010 The machine learning mix and match Goals Representation Capacity Control Operational Considerations Computational Considerations
More informationCS6375: Machine Learning Gautam Kunapuli. Support Vector Machines
Gautam Kunapuli Example: Text Categorization Example: Develop a model to classify news stories into various categories based on their content. sports politics Use the bag-of-words representation for this
More informationLogistic Regression. Stochastic Gradient Descent
Tutorial 8 CPSC 340 Logistic Regression Stochastic Gradient Descent Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1}
More informationSupport Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationStochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline
More informationThe perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.
1 The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt. The algorithm applies only to single layer models
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationStable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems
Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems Thore Graepel and Nicol N. Schraudolph Institute of Computational Science ETH Zürich, Switzerland {graepel,schraudo}@inf.ethz.ch
More informationLinearly-Convergent Stochastic-Gradient Methods
Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure
More informationQuasi-Newton Methods. Javier Peña Convex Optimization /36-725
Quasi-Newton Methods Javier Peña Convex Optimization 10-725/36-725 Last time: primal-dual interior-point methods Consider the problem min x subject to f(x) Ax = b h(x) 0 Assume f, h 1,..., h m are convex
More informationMaking Gradient Descent Optimal for Strongly Convex Stochastic Optimization
Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July
More informationLinear discriminant functions
Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative
More informationTrading Convexity for Scalability
Trading Convexity for Scalability Léon Bottou leon@bottou.org Ronan Collobert, Fabian Sinz, Jason Weston ronan@collobert.com, fabee@tuebingen.mpg.de, jasonw@nec-labs.com NEC Laboratories of America A Word
More informationSupport vector machines Lecture 4
Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The
More informationLecture 5: Logistic Regression. Neural Networks
Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationOptimization Methods for Machine Learning
Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationTraining Support Vector Machines: Status and Challenges
ICML Workshop on Large Scale Learning Challenge July 9, 2008 Chih-Jen Lin (National Taiwan Univ.) 1 / 34 Training Support Vector Machines: Status and Challenges Chih-Jen Lin Department of Computer Science
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationCSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18
CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H
More informationLinear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington
Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging
More informationIntroduction to Logistic Regression and Support Vector Machine
Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationSequential and reinforcement learning: Stochastic Optimization I
1 Sequential and reinforcement learning: Stochastic Optimization I Sequential and reinforcement learning: Stochastic Optimization I Summary This session describes the important and nowadays framework of
More informationLecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data
Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data DD2424 March 23, 2017 Binary classification problem given labelled training data Have labelled training examples? Given
More informationStochastic Quasi-Newton Methods
Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent
More informationMachine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d. from distribution
More informationLogistic Regression. COMP 527 Danushka Bollegala
Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will
More informationVariable Metric Stochastic Approximation Theory
Variable Metric Stochastic Approximation Theory Abstract We provide a variable metric stochastic approximation theory. In doing so, we provide a convergence theory for a large class of online variable
More informationMachine Learning Lecture 6 Note
Machine Learning Lecture 6 Note Compiled by Abhi Ashutosh, Daniel Chen, and Yijun Xiao February 16, 2016 1 Pegasos Algorithm The Pegasos Algorithm looks very similar to the Perceptron Algorithm. In fact,
More informationInverse Time Dependency in Convex Regularized Learning
Inverse Time Dependency in Convex Regularized Learning Zeyuan A. Zhu (Tsinghua University) Weizhu Chen (MSRA) Chenguang Zhu (Tsinghua University) Gang Wang (MSRA) Haixun Wang (MSRA) Zheng Chen (MSRA) December
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer
Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic
More informationNeural Networks (Part 1) Goals for the lecture
Neural Networks (Part ) Mark Craven and David Page Computer Sciences 760 Spring 208 www.biostat.wisc.edu/~craven/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed
More informationA Parallel SGD method with Strong Convergence
A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationSupervised Learning Part I
Supervised Learning Part I http://www.lps.ens.fr/~nadal/cours/mva Jean-Pierre Nadal CNRS & EHESS Laboratoire de Physique Statistique (LPS, UMR 8550 CNRS - ENS UPMC Univ. Paris Diderot) Ecole Normale Supérieure
More informationMachine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function
More informationFull-information Online Learning
Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More informationImproving L-BFGS Initialization for Trust-Region Methods in Deep Learning
Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Jacob Rafati http://rafati.net jrafatiheravi@ucmerced.edu Ph.D. Candidate, Electrical Engineering and Computer Science University
More informationMulti-Label Informed Latent Semantic Indexing
Multi-Label Informed Latent Semantic Indexing Shipeng Yu 12 Joint work with Kai Yu 1 and Volker Tresp 1 August 2005 1 Siemens Corporate Technology Department of Neural Computation 2 University of Munich
More informationLarge Scale Semi-supervised Linear SVMs. University of Chicago
Large Scale Semi-supervised Linear SVMs Vikas Sindhwani and Sathiya Keerthi University of Chicago SIGIR 2006 Semi-supervised Learning (SSL) Motivation Setting Categorize x-billion documents into commercial/non-commercial.
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new
More informationLarge-scale machine learning Stochastic gradient descent
Stochastic gradient descent IRKM Lab April 22, 2010 Introduction Very large amounts of data being generated quicker than we know what do with it ( 08 stats): NYSE generates 1 terabyte/day of new trade
More informationLogistic Regression: Online, Lazy, Kernelized, Sequential, etc.
Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010
More informationClassification Logistic Regression
Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham
More informationCSC321 Lecture 2: Linear Regression
CSC32 Lecture 2: Linear Regression Roger Grosse Roger Grosse CSC32 Lecture 2: Linear Regression / 26 Overview First learning algorithm of the course: linear regression Task: predict scalar-valued targets,
More informationStochastic gradient methods for machine learning
Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - September 2012 Context Machine
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationNumerical Optimization
Numerical Optimization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning
More informationParallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence
Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)
More informationDATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane
DATA MINING AND MACHINE LEARNING Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Linear models for regression Regularized
More informationSupport Vector and Kernel Methods
SIGIR 2003 Tutorial Support Vector and Kernel Methods Thorsten Joachims Cornell University Computer Science Department tj@cs.cornell.edu http://www.joachims.org 0 Linear Classifiers Rules of the Form:
More informationStochastic gradient descent on Riemannian manifolds
Stochastic gradient descent on Riemannian manifolds Silvère Bonnabel 1 Centre de Robotique - Mathématiques et systèmes Mines ParisTech SMILE Seminar Mines ParisTech Novembre 14th, 2013 1 silvere.bonnabel@mines-paristech
More informationLecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.
Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods
More informationLecture 4: Types of errors. Bayesian regression models. Logistic regression
Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture
More informationLinks between Perceptrons, MLPs and SVMs
Links between Perceptrons, MLPs and SVMs Ronan Collobert Samy Bengio IDIAP, Rue du Simplon, 19 Martigny, Switzerland Abstract We propose to study links between three important classification algorithms:
More informationAd Placement Strategies
Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationLarge-scale Linear RankSVM
Large-scale Linear RankSVM Ching-Pei Lee Department of Computer Science National Taiwan University Joint work with Chih-Jen Lin Ching-Pei Lee (National Taiwan Univ.) 1 / 41 Outline 1 Introduction 2 Our
More informationLinear Discrimination Functions
Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach
More informationAdaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ
More informationStatistical Machine Learning from Data
January 17, 2006 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Multi-Layer Perceptrons Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole
More informationQuasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization
Quasi-Newton Methods Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization 10-725 Last time: primal-dual interior-point methods Given the problem min x f(x) subject to h(x)
More informationCSCI567 Machine Learning (Fall 2018)
CSCI567 Machine Learning (Fall 2018) Prof. Haipeng Luo U of Southern California Sep 12, 2018 September 12, 2018 1 / 49 Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationModern Optimization Techniques
Modern Optimization Techniques 2. Unconstrained Optimization / 2.2. Stochastic Gradient Descent Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University
More informationMLCC 2017 Regularization Networks I: Linear Models
MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational
More informationSimple Neural Nets For Pattern Classification
CHAPTER 2 Simple Neural Nets For Pattern Classification Neural Networks General Discussion One of the simplest tasks that neural nets can be trained to perform is pattern classification. In pattern classification
More information