Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017
|
|
- Lizbeth Higgins
- 5 years ago
- Views:
Transcription
1 Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017
2 Step Sizes and Convergence
3 Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient descent Because we don t have to process the entire training set But converges to a noise ball lim T!1 E kx T x k 2 apple M 2µ µ 2
4 Controlling the accuracy Want the noise ball to be as small as possible for accurate solutions Noise ball proportional to the step size/learning rate lim T!1 E kx T x k 2 apple M 2µ µ 2 = O( ) So should we make the step size as small as possible?
5 Effect of step size on convergence E Let s go back to the convergence rate proof for SGD From the previous lecture, we have E hkx t+1 x k 2 x t i If we re far from the noise ball i.e. hkx t+1 x k 2 x t i apple (1 µ) 2 kx t x k M. kx t x k 2 2 M µ apple (1 µ) 2 kx t x k 2 + µ 2 kx t x k 2.
6 Effect of step size on convergence (continued) E hkx t+1 x k 2 x t i apple (1 µ) 2 kx t x k 2 + µ 2 kx t x k 2 µ apple 1 kx t x k 2 (if µ <1) 2 µ apple exp kx t x k 2. 2 So to contract by a factor of C, we need to run T steps, where 1 = exp µt 2 C, T = 2 µ log C
7 The Full Effect of Step Size Noise ball proportional to the step size lim E kx T x k 2 apple M T!1 2µ µ 2 = O( ) Convergence time inversely proportional to the step size So there s a trade-off! T = 2 µ log C
8 Demo
9 Can we get the best of both worlds? When do we want the step size to be large? At the beginning of of execution? execution! Near the end? Both? When do we want the step size to be small? At the beginning of execution? Near the end? end! Both? What about using a decreasing step size scheme?
10 SGD with Varying Step Size Allow the step size to vary over time x t+1 = x t t rf(x t ; y it ) Turns out this is the standard in basically all machine learning! Two ways to do it: Chosen a priori step sizes step size doesn t depend on measurements Adaptive step sizes choose step size based on measurements & heuristics
11 Optimal Step Sizes for Convex Objectives E Can we use math to choose a step size? Start with our previous bound h kx t+1 x k 2i h apple (1 t µ) 2 E kx t x k 2i + t 2 M h apple (1 t µ)e kx t x k 2i + t 2 M (for t µ<1) Right side is minimized when 0= µe h kx t x k 2i +2 t M, t = µ 2M E h kx t x k 2i
12 Optimal Step Sizes (continued) Let t = E hkx t x k 2i t+1 apple (1 t µ) t + t 2 M µ = 1 2M t µ t + µ 2 = t 2M 2 t + µ2 4M 2 t µ 2 = t 4M 2 t µ 2 2M t M
13 Optimal Step Sizes (continued) Let t = E h kx t x k 2i 1 t+1 (since (1 z) 1 1+z) t µ 2 4M 2 t µ 2 1 = 1 1 t 4M t 1 1+ µ2 4M t t 1 = 1 t + µ2 4M.
14 Optimal Step Sizes (continued) Let t = E hkx t x k 2i µ2 T T 0 4M ) T apple 4M 0 4M + µ 2 0 T Sometimes called a 1/T rate. Slower than the linear rate of gradient descent.
15 Optimal Step Sizes (continued) Substitute back in to find how to set the step size: t = µ 2M 4M 0 4M + µ 2 0 t = 2µ 0 4M + µ 2 0 t = 1 (t) This is a pretty common simple scheme General form is t = 0 1+ t
16 Demo
17 Have we solved step sizes for SGD forever? No. We don t usually know what µ, L, M, and 0 Even if the problem is convex are This optimal rate optimizes the upper bound on the expected distance-squared to the optimum But sometimes this bound is loose, and other step size schemes might do better
18 What if we don t know the parameters? One idea: still use a step size scheme of the form Choose parameters t = 0 1+ t via some other method 0 and t For example, hand-tuning which doesn t scale Can we do this automatically? Yes! This is an example of metaparameter optimization
19 Other Techniques Decrease the step size in epochs Still asymptotically t = 1 (t) but step size decreases in discrete steps Useful for parallelization and saves a little compute Per-parameter learning rates e.g. AdaGrad Replace scalar with diagonal matrix A. Helps with poorly scaled problems x t+1 = x t Arf(x t ; y it )
20 Mini-Batching
21 Gradient Descent vs. SGD Gradient descent: all examples at once x t+1 = x t t 1 N Stochastic gradient descent: one example at a time Is it really all or nothing? Can we do something intermediate? NX i=1 rf(x t ; y i ) x t+1 = x t t rf(x t ; y it )
22 Mini-Batch Stochastic Gradient Descent An intermediate approach x t+1 = x t t 1 B t X where B t is sampled uniformly from the set of all subsets of {1,, N} of size b. The b parameter is the batch size Typically choose b << N. Also called mini-batch gradient descent i2b t rf(x t ; y i )
23 Advantages of Mini-Batch Takes less time to compute each update than gradient descent Only needs to sum up b gradients, rather than N x t+1 = x t t 1 B t X i2b t rf(x t ; y i ) But takes more time for each update than SGD So what s the benefit? It s more like gradient descent, so maybe it converges faster than SGD?
24 Mini-Batch SGD Converges Start by breaking up the update rule into expected update and noise x t+1 x = x t x t (rh(x t ) rh(x )) 1 X t (rf(x t ; y i ) rh(x t )) B t i2b t Variance analysis Var (x t+1 x )=Var t 1 B t X i2b t (rf(x t ; y i ) rh(x t ))!
25 Mini-Batch SGD Converges (continued) Let i = rf(x t ; y i ) rh(x t ), and i = Var (x t+1 x )= 2 t B t 2 Var X = 2 t B t 2 Var = 2 t B t 2 NX i=1 NX j=1 (rf(x t ; y i ) rh(x t )) i2b t! NX i=1 i i i j i j ( 1 i 2 B t 0 i/2 B t!
26 Mini-Batch SGD Converges (continued) Because we sampled B uniformly at random, for i j E [ i j] =P (i 2 B ^ j 2 B) =P (i 2 B) P (j 2 B i 2 B) = b N E 2 b i = P (i 2 B) = N So we can write the variance as 0 Var (x t+1 x )= 2 t B t X i6=j b(b 1) N(N 1) i j + NX i=1 b N 2 i b 1 N 1 1 A
27 Mini-Batch SGD Converges (continued) Var (x t+1 x )= 2 t b 2 = 2 t bn = 2 t bn X i6=j b 1 N 1 b 1 N 1 b(b 1) N(N 1) NX i=1 NX j=1 NX i=1 i i j + i j + NX i=1 NX i=1 b N! 2 + N b N i NX i=1 1 A b 1 N i A 2 i 1 A = 2 t (N b) bn(n 1) NX i=1 2 i
28 Mini-Batch SGD Converges (continued) Var (x t+1 x )= 2 t (N b) b(n 1) 1 N apple 2 t (N b) b(n 1) M apple 2 M t b Compared with SGD, variance decreased by a factor of b NX i=1 = 2 t (N b) h b(n 1) E krf(x t ; y it ) 2 i rh(x t )k 2 x t i
29 Mini-Batch SGD Converges (continued) Recall that SGD converged to a noise ball of size lim E kx T x k 2 apple M T!1 2µ µ 2 Since mini-batching decreases variance by a factor of b, it will have h lim E kx T T!1 Noise ball smaller by the same factor! x k 2i apple M (2µ µ 2 )b
30 Advantages of Mini-Batch (reprise) Takes less time to compute each update than gradient descent Only needs to sum up b gradients, rather than N x t+1 = x t t 1 B t X i2b t rf(x t ; y i ) Converges to a smaller noise ball than stochastic gradient descent h lim E kx T T!1 x k 2i apple M (2µ µ 2 )b
31 How to choose the batch size? Mini-batching is not a free win Naively, compared with SGD, it takes b times as much effort to get a b-times-asaccurate answer But we could have gotten a b-times-as-accurate answer by just running SGD for b times as many steps with a step size of /b. But it still makes sense to run it for systems and statistical reasons Mini-batching exposes more parallelism Mini-batching lets us estimate statistics about the full gradient more accurately Another use case for metaparameter optimization
32 Mini-Batch SGD is very widely used Including in basically all neural network training b = 32 is a typical default value for batch size From Practical Recommendations for Gradient-Based Training of Deep Architectures, Bengio 2012.
33 Overfitting, Generalization Error, and Regularization
34 Minimizing Training Loss is Not our Real Goal Training loss looks like h(x) = 1 N What we actually want to minimize is expected loss on new examples Drawn from some real-world distribution ɸ NX i=1 f(x; y i ) h(x) =E y [f(x; y)] Typically, assume the training examples were drawn from this distribution
35 Overfitting Minimizing the training loss doesn't generally minimize the expected loss on new examples They are two different objective functions after all Difference between the empirical loss on the training set and the expected loss on new examples is called the generalization error Even a model that has high accuracy on the training set can have terrible performance on new examples Phenomenon is called overfitting
36 Demo
37 How to address overfitting Many, many techniques to deal with overfitting Have varying computational costs But this is a systems course so what can we do with little or no extra computational cost? Notice from the demo that some loss functions do better than others Can we modify our loss function to prevent overfitting?
38 Regularization Add an extra regularization term to the objective function Most popular type: L2 regularization h(x) = 1 N NX f(x; y i )+ 2 kxk 2 2 = 1 N i=1 Also popular: L1 regularization h(x) = 1 NX f(x; y i )+ kxk N 1 = 1 N i=1 NX i=1 NX i=1 f(x; y i )+ 2 dx k=1 x 2 i dx f(x; y i )+ x i k=1
39 Benefits of Regularization Cheap to compute For SGD and L2 regularization, there s just an extra scaling x t+1 =(1 2 t 2 )x t t rf(x t ; y it ) Makes the objective strongly convex This makes it easier to get and prove bounds on convergence Helps with overfitting
40 Motivation One way to think about regularization is as a Bayesian prior MLE interpretation of learning problem: P (y i x) = 1 Z exp( f(x; y i)) P (x y) = P (y x) P (x) P (y) Taking the logarithm: log P (x y) = log(z) = P (y x) P (x) P (y) NX i=1 NY i=1 1 Z exp( f(x; y i ) + log P (x) f(x; y i)) log P (y)
41 Motivation (continued) So the MLE problem becomes max x log P (x y) = N X i=1 Now we need to pick a prior probability distribution for x Say we choose a Gaussian prior max x log P (x y) = N X = i=1 f(x; y i ) + log NX f(x; y i ) i=1 f(x; y i ) + log P (x) + (constants) 2 1 p 2 exp!! 2 kxk 2 +(C) 2 2 kxk2 +(C) There s our regularization term!
42 Motivation (continued) By setting a prior, we limit the values we think the model can have This prevents the model from doing bad things to try to fit the data Like the polynomial-fitting example from the demo
43 Demo
44 How to choose the regularization parameter? One way is to use an independent validation set to estimate the test error, and set the regularization parameter manually so that it is high enough to avoid overfitting This is what we saw in the demo But doing this naively can be computationally expensive Need to re-run learning algorithm many times Yet another use case for metaparameter optimization
45 Early Stopping
46 Asymptotically large training sets Setting 1: we have a distribution ɸ and we sample a very large (asymptotically infinite) number of points from it, then run stochastic gradient descent on that training set for only N iterations. Can our algorithm in this setting overfit? No, because its training set is asymptotically equal to the true distribution. Can we compute this efficiently? No, because its training set is asymptotically infinitely large
47 Consider a second setting Setting 1: we have a distribution ɸ and we sample a very large (asymptotically infinite) number of points from it, then run stochastic gradient descent on that training set for only N iterations. Setting 2: we have a distribution ɸ and we sample N points from it, then run stochastic gradient descent using each of these points exactly once. What is the difference between the output of SGD in these two settings? Asymptotically, there s no difference! So SGD in Setting 2 will also never overfit
48 Early Stopping Motivation: if we only use each training example once for SGD, then we can t overfit. So if we only use each example a few times, we probably won t overfit too much. Early stopping: just stop running SGD before it converges.
49 Benefits of Early Stopping Cheap to compute Literally just does less work It seems like the technique was designed to make systems run faster Helps with overfitting
50 How Early to Stop You ll see this in detail in next Wednesday s paper. Yet another application of metaparameter tuning.
51 Questions? Upcoming things Labor day next Monday no lecture Paper Presentation #1 on Wednesday read paper before class
Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017
Stochastic Gradient Descent: The Workhorse of Machine Learning CS6787 Lecture 1 Fall 2017 Fundamentals of Machine Learning? Machine Learning in Practice this course What s missing in the basic stuff? Efficiency!
More informationNon-Convex Optimization. CS6787 Lecture 7 Fall 2017
Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper
More informationThe Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017
The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationComments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms
Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:
More informationCSC321 Lecture 9: Generalization
CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 26 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions
More informationCSC321 Lecture 9: Generalization
CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 27 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More informationAlgorithms other than SGD. CS6787 Lecture 10 Fall 2017
Algorithms other than SGD CS6787 Lecture 10 Fall 2017 Machine learning is not just SGD Once a model is trained, we need to use it to classify new examples This inference task is not computed with SGD There
More informationCPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017
CPSC 340: Machine Learning and Data Mining Stochastic Gradient Fall 2017 Assignment 3: Admin Check update thread on Piazza for correct definition of trainndx. This could make your cross-validation code
More informationStochastic Gradient Descent. CS 584: Big Data Analytics
Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration
More informationCSC321 Lecture 7: Optimization
CSC321 Lecture 7: Optimization Roger Grosse Roger Grosse CSC321 Lecture 7: Optimization 1 / 25 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:
More informationCS260: Machine Learning Algorithms
CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {
More informationMachine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari
Machine Learning Basics: Stochastic Gradient Descent Sargur N. srihari@cedar.buffalo.edu 1 Topics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation Sets
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationCSC321 Lecture 8: Optimization
CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:
More informationCSC321 Lecture 2: Linear Regression
CSC32 Lecture 2: Linear Regression Roger Grosse Roger Grosse CSC32 Lecture 2: Linear Regression / 26 Overview First learning algorithm of the course: linear regression Task: predict scalar-valued targets,
More informationLeast Mean Squares Regression. Machine Learning Fall 2018
Least Mean Squares Regression Machine Learning Fall 2018 1 Where are we? Least Squares Method for regression Examples The LMS objective Gradient descent Incremental/stochastic gradient descent Exercises
More informationWhy should you care about the solution strategies?
Optimization Why should you care about the solution strategies? Understanding the optimization approaches behind the algorithms makes you more effectively choose which algorithm to run Understanding the
More informationLeast Mean Squares Regression
Least Mean Squares Regression Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Lecture Overview Linear classifiers What functions do linear classifiers express? Least Squares Method
More informationIntroduction to Bayesian Learning. Machine Learning Fall 2018
Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability
More informationCOMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation
COMP 55 Applied Machine Learning Lecture 2: Bayesian optimisation Associate Instructor: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp55 Unless otherwise noted, all material posted
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 2, 2015 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)
More informationLecture 2: Linear regression
Lecture 2: Linear regression Roger Grosse 1 Introduction Let s ump right in and look at our first machine learning algorithm, linear regression. In regression, we are interested in predicting a scalar-valued
More informationDay 3 Lecture 3. Optimizing deep networks
Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient
More informationDeep Learning & Neural Networks Lecture 4
Deep Learning & Neural Networks Lecture 4 Kevin Duh Graduate School of Information Science Nara Institute of Science and Technology Jan 23, 2014 2/20 3/20 Advanced Topics in Optimization Today we ll briefly
More informationLogistic Regression. William Cohen
Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting
More informationIntroduction to Machine Learning HW6
CS 189 Spring 2018 Introduction to Machine Learning HW6 Your self-grade URL is http://eecs189.org/self_grade?question_ids=1_1,1_ 2,2_1,2_2,3_1,3_2,3_3,4_1,4_2,4_3,4_4,4_5,4_6,5_1,5_2,6. This homework is
More informationIFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent
IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):
More informationStochastic Gradient Descent. Ryan Tibshirani Convex Optimization
Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationJ. Sadeghi E. Patelli M. de Angelis
J. Sadeghi E. Patelli Institute for Risk and, Department of Engineering, University of Liverpool, United Kingdom 8th International Workshop on Reliable Computing, Computing with Confidence University of
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationLogistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu
Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data
More informationLecture 4: Types of errors. Bayesian regression models. Logistic regression
Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture
More informationCOMP 551 Applied Machine Learning Lecture 20: Gaussian processes
COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationLecture 9: Generalization
Lecture 9: Generalization Roger Grosse 1 Introduction When we train a machine learning model, we don t just want it to learn to model the training data. We want it to generalize to data it hasn t seen
More informationWeek 3: Linear Regression
Week 3: Linear Regression Instructor: Sergey Levine Recap In the previous lecture we saw how linear regression can solve the following problem: given a dataset D = {(x, y ),..., (x N, y N )}, learn to
More informationMachine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)
Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang (Chap. 12 of CIML) Nonlinear Features x4: -1 x1: +1 x3: +1 x2: -1 Concatenated (combined) features XOR:
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationNonlinear Optimization Methods for Machine Learning
Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks
More informationLecture 6 Optimization for Deep Neural Networks
Lecture 6 Optimization for Deep Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 12, 2017 Things we will look at today Stochastic Gradient Descent Things
More informationSVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning
SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 www.cs.ubc.ca/~schmidtm/svan16 Some images from this lecture are
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationCSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent
CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic
More informationModeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop
Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector
More informationNeural Networks: Backpropagation
Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationWays to make neural networks generalize better
Ways to make neural networks generalize better Seminar in Deep Learning University of Tartu 04 / 10 / 2014 Pihel Saatmann Topics Overview of ways to improve generalization Limiting the size of the weights
More informationSupport Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationLogistic Regression. COMP 527 Danushka Bollegala
Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationLecture 5: Logistic Regression. Neural Networks
Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture
More informationMachine Learning (CSE 446): Backpropagation
Machine Learning (CSE 446): Backpropagation Noah Smith c 2017 University of Washington nasmith@cs.washington.edu November 8, 2017 1 / 32 Neuron-Inspired Classifiers correct output y n L n loss hidden units
More informationCS60021: Scalable Data Mining. Large Scale Machine Learning
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance
More informationMachine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 20, 2012 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationCSC 578 Neural Networks and Deep Learning
CSC 578 Neural Networks and Deep Learning Fall 2018/19 3. Improving Neural Networks (Some figures adapted from NNDL book) 1 Various Approaches to Improve Neural Networks 1. Cost functions Quadratic Cross
More informationClustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26
Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar
More informationPresentation in Convex Optimization
Dec 22, 2014 Introduction Sample size selection in optimization methods for machine learning Introduction Sample size selection in optimization methods for machine learning Main results: presents a methodology
More informationDeep Learning & Artificial Intelligence WS 2018/2019
Deep Learning & Artificial Intelligence WS 2018/2019 Linear Regression Model Model Error Function: Squared Error Has no special meaning except it makes gradients look nicer Prediction Ground truth / target
More informationValue Function Methods. CS : Deep Reinforcement Learning Sergey Levine
Value Function Methods CS 294-112: Deep Reinforcement Learning Sergey Levine Class Notes 1. Homework 2 is due in one week 2. Remember to start forming final project groups and writing your proposal! Proposal
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationOptimization and Gradient Descent
Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationNeural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35
Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How
More informationDan Roth 461C, 3401 Walnut
CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn
More informationCSE446: Clustering and EM Spring 2017
CSE446: Clustering and EM Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, Dan Klein, and Luke Zettlemoyer Clustering systems: Unsupervised learning Clustering Detect patterns in unlabeled
More informationCSE 591: Introduction to Deep Learning in Visual Computing. - Parag S. Chandakkar - Instructors: Dr. Baoxin Li and Ragav Venkatesan
CSE 591: Introduction to Deep Learning in Visual Computing - Parag S. Chandakkar - Instructors: Dr. Baoxin Li and Ragav Venkatesan Overview Background Why another network structure? Vanishing and exploding
More informationDeep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More informationIntroduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur
Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Module 2 Lecture 05 Linear Regression Good morning, welcome
More informationCOMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017
COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS
More informationParameter estimation Conditional risk
Parameter estimation Conditional risk Formalizing the problem Specify random variables we care about e.g., Commute Time e.g., Heights of buildings in a city We might then pick a particular distribution
More informationSupport Vector Machines
Support Vector Machines INFO-4604, Applied Machine Learning University of Colorado Boulder September 28, 2017 Prof. Michael Paul Today Two important concepts: Margins Kernels Large Margin Classification
More informationCPSC 340: Machine Learning and Data Mining. Gradient Descent Fall 2016
CPSC 340: Machine Learning and Data Mining Gradient Descent Fall 2016 Admin Assignment 1: Marks up this weekend on UBC Connect. Assignment 2: 3 late days to hand it in Monday. Assignment 3: Due Wednesday
More informationWeek 5: Logistic Regression & Neural Networks
Week 5: Logistic Regression & Neural Networks Instructor: Sergey Levine 1 Summary: Logistic Regression In the previous lecture, we covered logistic regression. To recap, logistic regression models and
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationNeural Networks and Deep Learning
Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost
More informationData Mining & Machine Learning
Data Mining & Machine Learning CS57300 Purdue University March 1, 2018 1 Recap of Last Class (Model Search) Forward and Backward passes 2 Feedforward Neural Networks Neural Networks: Architectures 2-layer
More informationTutorial on Machine Learning for Advanced Electronics
Tutorial on Machine Learning for Advanced Electronics Maxim Raginsky March 2017 Part I (Some) Theory and Principles Machine Learning: estimation of dependencies from empirical data (V. Vapnik) enabling
More informationIs the test error unbiased for these programs?
Is the test error unbiased for these programs? Xtrain avg N o Preprocessing by de meaning using whole TEST set 2017 Kevin Jamieson 1 Is the test error unbiased for this program? e Stott see non for f x
More informationCS60010: Deep Learning
CS60010: Deep Learning Sudeshna Sarkar Spring 2018 16 Jan 2018 FFN Goal: Approximate some unknown ideal function f : X! Y Ideal classifier: y = f*(x) with x and category y Feedforward Network: Define parametric
More informationIntroduction to Neural Networks
Introduction to Neural Networks Philipp Koehn 3 October 207 Linear Models We used before weighted linear combination of feature values h j and weights λ j score(λ, d i ) = j λ j h j (d i ) Such models
More informationDeep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation
Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Steve Renals Machine Learning Practical MLP Lecture 5 16 October 2018 MLP Lecture 5 / 16 October 2018 Deep Neural Networks
More informationMachine Learning Basics: Maximum Likelihood Estimation
Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning
More informationIntroduction to Optimization
Introduction to Optimization Konstantin Tretyakov (kt@ut.ee) MTAT.03.227 Machine Learning So far Machine learning is important and interesting The general concept: Fitting models to data So far Machine
More information12 Statistical Justifications; the Bias-Variance Decomposition
Statistical Justifications; the Bias-Variance Decomposition 65 12 Statistical Justifications; the Bias-Variance Decomposition STATISTICAL JUSTIFICATIONS FOR REGRESSION [So far, I ve talked about regression
More informationLinear classifiers: Overfitting and regularization
Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationMachine Learning and Data Mining. Linear regression. Kalev Kask
Machine Learning and Data Mining Linear regression Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ Parameters q Learning algorithm Program ( Learner ) Change q Improve performance
More informationCSC2541 Lecture 5 Natural Gradient
CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient 1 / 12 Motivation Two classes of optimization procedures used throughout ML (stochastic) gradient descent,
More informationCSC321 Lecture 18: Learning Probabilistic Models
CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling
More informationSummary and discussion of: Dropout Training as Adaptive Regularization
Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial
More informationMaximum Likelihood, Logistic Regression, and Stochastic Gradient Training
Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions
More informationECE521 Lecture 7/8. Logistic Regression
ECE521 Lecture 7/8 Logistic Regression Outline Logistic regression (Continue) A single neuron Learning neural networks Multi-class classification 2 Logistic regression The output of a logistic regression
More informationUnraveling the mysteries of stochastic gradient descent on deep neural networks
Unraveling the mysteries of stochastic gradient descent on deep neural networks Pratik Chaudhari UCLA VISION LAB 1 The question measures disagreement of predictions with ground truth Cat Dog... x = argmin
More information