Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Similar documents
Machine Learning Basics: Maximum Likelihood Estimation

Linear Models for Regression. Sargur Srihari

Gradient Descent. Sargur Srihari

Deep Feedforward Networks. Sargur N. Srihari

Machine Learning Basics: Estimators, Bias and Variance

Gradient-Based Learning. Sargur N. Srihari

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Neural Networks and Deep Learning

Deep Feedforward Networks

Optimization for Training I. First-Order Methods Training algorithm

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

Bayesian Linear Regression. Sargur Srihari

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Machine Learning Basics III

Support Vector Machines

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Introduction to Logistic Regression and Support Vector Machine

We choose parameter values that will minimize the difference between the model outputs & the true function values.

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

OPTIMIZATION METHODS IN DEEP LEARNING

Ad Placement Strategies

CSC321 Lecture 2: Linear Regression

Optimization for neural networks

CSC321 Lecture 9: Generalization

Parameter Norm Penalties. Sargur N. Srihari

Lecture 5: GPs and Streaming regression

CS-E3210 Machine Learning: Basic Principles

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Overfitting, Bias / Variance Analysis

Linear Models for Regression

Optimization and Gradient Descent

Advanced computational methods X Selected Topics: SGD

Announcements. Proposals graded

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

CSC321 Lecture 8: Optimization

Kernelized Perceptron Support Vector Machines

CS60021: Scalable Data Mining. Large Scale Machine Learning

Tutorial on Machine Learning for Advanced Electronics

CSC321 Lecture 9: Generalization

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Bits of Machine Learning Part 1: Supervised Learning

Machine Learning. Lecture 2: Linear regression. Feng Li.

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Coordinate Descent and Ascent Methods

Neural Network Training

Introduction to Machine Learning

Midterm exam CS 189/289, Fall 2015

CMU-Q Lecture 24:

Lecture 2 Machine Learning Review

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Linear Regression (continued)

Deep Learning & Artificial Intelligence WS 2018/2019

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

CSC321 Lecture 7: Optimization

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

ECS289: Scalable Machine Learning

ECE521 week 3: 23/26 January 2017

Bayesian Deep Learning

Classification: Logistic Regression from Data

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

J. Sadeghi E. Patelli M. de Angelis

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

UNSUPERVISED LEARNING

Introduction to Machine Learning

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Logistic Regression Logistic

Classification: Logistic Regression from Data

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Lecture 1: Supervised Learning

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machines

Deep Learning Lab Course 2017 (Deep Learning Practical)

Machine Learning and Data Mining. Linear regression. Kalev Kask

Iterative Reweighted Least Squares

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Stochastic Gradient Descent

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Loss Functions and Optimization. Lecture 3-1

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Linear classifiers: Overfitting and regularization

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Review: Support vector machines. Machine learning techniques and image analysis

Convolution and Pooling as an Infinitely Strong Prior

Introduction to Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Bayesian Support Vector Machines for Feature Ranking and Selection

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Transcription:

Machine Learning Basics: Stochastic Gradient Descent Sargur N. srihari@cedar.buffalo.edu 1

Topics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation Sets 4. Estimators, Bias and Variance 5. Maximum Likelihood Estimation 6. Bayesian Statistics 7. Supervised Learning Algorithms 8. Unsupervised Learning Algorithms 9. Stochastic Gradient Descent 10. Building a Machine Learning Algorithm 11. Challenges Motivating Deep Learning 2

Stochastic Gradient Descent (SGD) Nearly all deep learning is powered by SGD SGD extends the gradient descent algorithm Recall gradient descent: Suppose y=f(x) where both x and y are real nos Derivative is a function denoted as f (x) or dy/dx It gives the slope of f(x) at the point x i.e., it specifies how to make a small change in the input to make a corresponding change in the output f(x+ε) f(x)+εf (x) 3

How Gradient Descent uses derivatives Criterion f(x) minimized by moving from current solution in direction of the negative of gradient 4

Gradient with multiple inputs For multiple inputs we need partial derivatives x i f (x) is how f changes as only x i increases Gradient of f is a vector of partial derivatives Gradient descent proposes a new point x' = x - ε x f ( x) where ε is the learning rate, a positive scalar. Set to a small constant x f ( x) 5

Learning rate for deep learning Useful to reduce ε as training progresses Constant learning rate is default in Keras Momentum and decay are set to 0 by default keras.optimizers.sgd(lr=0.1, momentum=0.0, decay=0.0, nesterov=false) Constant learning rate Time-based decay: decay_rate=learning_rate/epochs) SGD(lr=0.1, momentum=0.8, decay=decay_rate, Nesterov=False) 6

Computational bottleneck A recurring problem in machine learning: large training sets are necessary for good generalization but large training sets are also computationally expensive SGD is an extension of gradient descent that offers a solution Moreover it is a method of generalization beyond the training set 7

Cost function is sum over samples Criterion in machine learning is a cost function Cost function often decomposes as a sum of per sample loss function E.g., Negative conditional log-likelihood of training data is J(θ) = E x,y~ˆpdata L(x,y,θ) = 1 m where m is the no. of samples and m i=1 L( x (i),y (i),θ) J(θ) = 1 2 L is the per-example loss L(x,y,θ)= - log p(y x;θ) In linear regression we minimize m i=1 { y (i) θ T x } (i) 2 8

Gradient is also sum over samples For these additive cost functions, gradient descent requires computing m ( ) m ln p(y X,θ, β) = β { y (i) θ T x (i) } x (i)t i=1 i=1 θ J(θ) = 1 m θ L x (i),y (i),θ In linear regression Computational cost of this operation is O(m) As training set size grows to billions, time taken for single gradient step becomes prohibitively long 9

Insight of SGD Insight: Gradient is an expectation Expectation may be approximated using small set of samples In each step of SGD we can sample a minibatch of examples B ={x (1),..,x (m ) } drawn uniformly from the training set Minibatch size m is typically chosen to be small: 1 to a hundred Crucially m is held fixed even if sample set is in billions We may fit a training set with billions of examples using updates computed on only a hundred examples m ( ) θ J(θ) = 1 m θ L x (i),y (i),θ i=1 10

SGD Estimate on minibatch Estimate of gradient is formed as g = 1 m ' m ' L x (i),y (i),θ θ ( ) using examples from minibatch B i=1 SGD then follows the estimated gradient downhill θ θ εg where ε is the learning rate 11

How good is SGD? In the past gradient descent was regarded as slow and unreliable Application of gradient descent to non-convex optimization problems was regarded as unprincipled SGD is not guaranteed to arrive at even a local minumum in reasonable time But it often finds a very low value of the cost function quickly enough 12

SGD and Training Set Size Outside of deep learning SGD is the main way to train large linear models on very large data sets Without SGD cost per update increases with m Cost per SGD update does not depend on the training set size m (it depends only on m ) As mà model will eventually converge to its best possible test error before SGD has sampled every example in the training set Asymptotic cost of training a model with SGD is O(1) as a function of m 13

Deep Learning vs SVM Prior to advent of DL main way to learn nonlinear models was to use the kernel trick in combination with a linear model SVM: f (x) = w T x + b = b + α i x T x (i) m i=1 Replace x by a feature function ϕ(x) and the dot product with a kernel function k(x,x (i) )=ϕ(x)ϕ(x (i) ) Requires constructing an m x m matrix G i,j =k(x (i),x (j) ) Constructing this matrix is O(m 2 ) 14

Growth of interest in Deep Learning In academia (with medium sized data sets) Starting in 2006, deep learning was interesting because it performed better on data sets with thousands of examples In industry (with large data sets) Because it provided a scalable way of training nonlinear models on large datasets 15