Stochastic Gradient Descent. CS 584: Big Data Analytics

Similar documents
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

ECS171: Machine Learning

Linear classifiers: Overfitting and regularization

Comparison of Modern Stochastic Optimization Algorithms

CS260: Machine Learning Algorithms

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Why should you care about the solution strategies?

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Term Paper: How and Why to Use Stochastic Gradient Descent?

Large-scale Stochastic Optimization

Logistic Regression. Stochastic Gradient Descent

Stochastic Gradient Descent

STA141C: Big Data & High Performance Statistical Computing

Beyond stochastic gradient descent for large-scale machine learning

Linear Regression (continued)

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Fast Stochastic Optimization Algorithms for ML

More Optimization. Optimization Methods. Methods

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Stochastic optimization in Hilbert spaces

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

CPSC 540: Machine Learning

Beyond stochastic gradient descent for large-scale machine learning

Convex Optimization Lecture 16

Stochastic Gradient Descent

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

CSC321 Lecture 7: Optimization

CS60021: Scalable Data Mining. Large Scale Machine Learning

CSC321 Lecture 8: Optimization

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Logistic Regression. COMP 527 Danushka Bollegala

Stochastic gradient descent and robustness to ill-conditioning

Least Mean Squares Regression. Machine Learning Fall 2018

Least Mean Squares Regression

Stochastic Optimization Algorithms Beyond SG

Nonlinear Optimization Methods for Machine Learning

Big Data Analytics. Lucas Rego Drumond

Lecture 1: Supervised Learning

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Machine Learning & Data Mining CMS/CS/CNS/EE 155. Lecture 2: Perceptron & Gradient Descent

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

A Parallel SGD method with Strong Convergence

Day 3 Lecture 3. Optimizing deep networks

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

ECS289: Scalable Machine Learning

Optimization Methods for Machine Learning

Dan Roth 461C, 3401 Walnut

Deep Learning Lab Course 2017 (Deep Learning Practical)

Large Scale Machine Learning with Stochastic Gradient Descent

Sample questions for Fundamentals of Machine Learning 2018

Overfitting, Bias / Variance Analysis

Large-scale machine learning Stochastic gradient descent

Optimization for Machine Learning

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Logistic Regression Logistic

Gradient Descent. Sargur Srihari

J. Sadeghi E. Patelli M. de Angelis

Introduction to Logistic Regression and Support Vector Machine

Machine Learning. Lecture 2: Linear regression. Feng Li.

Machine Learning

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Advanced computational methods X Selected Topics: SGD

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning

Data Mining & Machine Learning

Gradient Boosting (Continued)

Selected Topics in Optimization. Some slides borrowed from

Computational and Statistical Learning Theory

CSC321 Lecture 2: Linear Regression

Linear Models for Regression

Kernelized Perceptron Support Vector Machines

Linear discriminant functions

Importance Sampling for Minibatches

Accelerating Stochastic Optimization

HOMEWORK #4: LOGISTIC REGRESSION

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Optimization for Machine Learning

Algorithmic Stability and Generalization Christoph Lampert

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Stochastic gradient descent on Riemannian manifolds

An interior-point stochastic approximation method and an L1-regularized delta rule

Machine Learning Basics

Stochastic Gradient Methods for Large-Scale Machine Learning

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Introduction to Logistic Regression

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Modern Optimization Techniques

OPTIMIZATION METHODS IN DEEP LEARNING

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

Transcription:

Stochastic Gradient Descent CS 584: Big Data Analytics

Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration is relatively cheap Can be slow to converge

Example: Linear Regression Optimization problem: min w Xw y 2 Closed form solution: w =(X > X) 1 X > y Gradient update: w + 1 = w m X (x > i w y i )x i i Requires an entire pass through the data!

Tackling Compute Problems: Scaling to Large n Streaming implementation Parallelize your batch algorithm Aggressively subsample the data Change algorithm or training method Optimization is a surrogate for learning Trade-off weaker optimization with more data

Tradeoffs of Large Scale Learning True (generalization) error is a function of approximation error, estimation error, and optimization error subject to number of training examples and computational time Solution will depend on which budget constraint is active Bottom and Bousquet (2011). The Tradeoffs of Large-Scale Learning. In Optimization for Machine Learning (pp. 351 368).

Minimizing Generalization Error If n!1, then " est! 0 For fixed generalization error, as number of samples increases, we can increase optimization tolerance Talk by Aditya Menon, UCSD

Expected Risk vs Empirical Risk Minimization Expected Risk Assume we know the ground truth distribution P(x,y) Expected risk associated with classification function Z E(f w )= L(f w (x),y)dp (x, y) Empirical Risk Real world, ground truth distribution is not known Only empirical risk can be calculated for function E n (f w )= 1 n X L(f w (x i ),y i ) i = E[L(f w (x),y)]

Gradient Descent Reformulated w + = w 1 n X i r w L(f w (x i ),y i ) learning rate or gain re n (f w ) True gradient descent is a batch algorithm, slow but sure Under sufficient regularity assumptions, initial estimate is close to the optimal and gain is sufficiently small, there is linear convergence

Stochastic Optimization Motivation Information is redundant amongst samples Sufficient samples means we can afford more frequent, noisy updates Never-ending stream means we should not wait for all data Tracking non-stationary data means that the target is moving

Stochastic Optimization Idea: Estimate function and gradient from a small, current subsample of your data and with enough iterations and data, you will converge to the true minimum Pro: Better for large datasets and often faster convergence Con: Hard to reach high accuracy Con: Best classical methods can t handle stochastic approximation Con: Theoretical definitions for convergence not as welldefined

Stochastic Gradient Descent (SGD) Randomized gradient estimate to minimize the function using a single randomly picked example Instead of rf,use rf,where E[ rf] =rf The resulting update is of the form: w + = w r w L(f w (x i,y i )) Although random noise is introduced, it behaves like gradient descent in its expectation

SGD Algorithm Randomly initialize parameter w and learning rate while Not Converged do Randomly shu e examples in training set for i =1,,N do w + = w r w L(f w (x i,y i )) end end

The Benefits of SGD Gradient is easy to calculate ( instantaneous ) Less prone to local minima Small memory footprint Get to a reasonable solution quickly Works for non-stationary environments as well as online settings Can be used for more complex models and error surfaces

Importance of Learning Rate Learning rate has a large impact on convergence Too small > too slow Too large > oscillatory and may even diverge Should learning rate be fixed or adaptive? Is convergence necessary? Non-stationary: convergence may not be required or desired Stationary: learning rate should decrease with time Robbins-Monroe sequence is adequate t = 1 t

Mini-batch Stochastic Gradient Descent Rather than using a single point, use a random subset where the size is less than the original data size w + 1 X = w r w L(f w (x i,y i )), where S k [n] S k i2s k Like the single random sample, the full gradient is approximated via an unbiased noisy estimate Random subset reduces the variance by a factor of 1/ Sk, but is also Sk times more expensive

Example: Regularized Logistic Regression Optimization problem: min 1 n X Gradient computation: i y i x > i + log(1 + e x> i ) + 2 2 2 rf( )= X i (y i p i ( ))x i + Update costs: Batch: O(nd) Stochastic: O(d) Batch is doable if n is moderate but not when n is huge Mini-batch: O( S k d)

Example: n=10,000, d=20 Iterations make better progress as mini-batch size is larger but also takes more computation time http://stat.cmu.edu/~ryantibs/convexopt/lectures/25-fast-stochastic.pdf

SGD Updates for Various Systems Bottou, L. (2012). Stochastic Gradient Descent Tricks. Neural Networks Tricks of the Trade.

Asymptotic Analysis of GD and SGD Bottou, L. (2012). Stochastic Gradient Descent Tricks. Neural Networks Tricks of the Trade.

SGD Recommendations Randomly shuffle training examples Although theory says you should randomly pick examples, it is easier to make a pass through your training set sequentially Shuffling before each iteration eliminates the effect of order Monitor both training cost and validation error Set aside samples for a decent validation set Compute the objective on the training set and validation set (expensive but better than overfitting or wasting computation) Bottou, L. (2012). Stochastic Gradient Descent Tricks. Neural Networks Tricks of the Trade.

SGD Recommendations (2) Check gradient using finite differences If computation is slightly incorrect can yield erratic and slow algorithm Verify your code by slightly perturbing the parameter and inspecting differences between the two gradients Experiment with the learning rates using small sample of training set SGD convergence rates are independent from sample size Use traditional optimization algorithms as a reference point

SGD Recommendations (3) Leverage sparsity of the training examples For very high-dimensional vectors with few non zero coefficients, you only need to update the weight coefficients corresponding to nonzero pattern in x Use learning rates of the form t = 0 (1 + 0 t) 1 Allows you to start from reasonable learning rates determined by testing on a small sample Works well in most situations if the initial point is slightly smaller than best value observed in training sample

Some Resources for SGD Francis Bach s talk in 2012: http://www.ann.jussieu.fr/~plc/ bach2012.pdf Stochastic Gradient Methods Workshop: http:// yaroslavvb.blogspot.com/2014/03/stochastic-gradientmethods-2014.html Python implementation in scikit-learn: http://scikit-learn.org/ stable/modules/sgd.html ipython notebook for implementing GD and SGD in Python: https://github.com/dtnewman/gradient_descent/blob/master/ stochastic_gradient_descent.ipynb