Advanced computational methods X Selected Topics: SGD

Similar documents
On the fast convergence of random perturbations of the gradient flow.

Neural Network Training

A random perturbation approach to some stochastic approximation algorithms in optimization.

CSC321 Lecture 9: Generalization

CSC321 Lecture 9: Generalization

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Stochastic Optimization Algorithms Beyond SG

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Neural Networks and Deep Learning

Introduction to gradient descent

Day 3 Lecture 3. Optimizing deep networks

Sequential and reinforcement learning: Stochastic Optimization I

ECS171: Machine Learning

Linear Models in Machine Learning

Linear Regression (continued)

A summary of Deep Learning without Poor Local Minima

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

ECE521 Lectures 9 Fully Connected Neural Networks

Overfitting, Bias / Variance Analysis

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Lecture 1: Supervised Learning

Adaptive Gradient Methods AdaGrad / Adam

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Mini-Course 1: SGD Escapes Saddle Points

Deep Learning Lab Course 2017 (Deep Learning Practical)

CSCI567 Machine Learning (Fall 2018)

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Logistic Regression. COMP 527 Danushka Bollegala

Introduction to Neural Networks

SGD and Deep Learning

Selected Topics in Optimization. Some slides borrowed from

Computational statistics

Linear Models for Regression

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Linear classifiers: Overfitting and regularization

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC

Based on the original slides of Hung-yi Lee

Deep Relaxation: PDEs for optimizing Deep Neural Networks

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Improving the Convergence of Back-Propogation Learning with Second Order Methods

CENG 793. On Machine Learning and Optimization. Sinan Kalkan

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

CSCI567 Machine Learning (Fall 2014)

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Information theoretic perspectives on learning algorithms

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Introduction to Convolutional Neural Networks (CNNs)

CSC321 Lecture 4: Learning a Classifier

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

HOMEWORK #4: LOGISTIC REGRESSION

Introduction to Logistic Regression and Support Vector Machine

1 Review and Overview

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Least Mean Squares Regression

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Regression with Numerical Optimization. Logistic

Comparison of Modern Stochastic Optimization Algorithms

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

More Optimization. Optimization Methods. Methods

Large-scale Stochastic Optimization

Topic 3: Neural Networks

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Introduction to (Convolutional) Neural Networks

Numerical Computation for Deep Learning

CSC321 Lecture 7: Optimization

ECE521 Lecture 7/8. Logistic Regression

CSC321 Lecture 8: Optimization

y(x n, w) t n 2. (1)

CSC321 Lecture 2: Linear Regression

Stochastic gradient descent; Classification

OPTIMIZATION METHODS IN DEEP LEARNING

STA141C: Big Data & High Performance Statistical Computing

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Nonlinear Optimization Methods for Machine Learning

EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN)

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

CSC321 Lecture 4: Learning a Classifier

Tutorial on Machine Learning for Advanced Electronics

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression

ECS289: Scalable Machine Learning

Stochastic Gradient Descent. CS 584: Big Data Analytics

Lecture 4: Perceptrons and Multilayer Perceptrons

Optimization Tutorial 1. Basic Gradient Descent

Error Functions & Linear Regression (1)

Notes on AdaGrad. Joseph Perla 2014

HOMEWORK #4: LOGISTIC REGRESSION

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Lecture 17: Neural Networks and Deep Learning

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Multilayer Perceptron

CS60021: Scalable Data Mining. Large Scale Machine Learning

Transcription:

Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety of handwriting digital images. You can find them on Yann LeCun s website. Each MNIST datum is like (x, y) where x is the image (array of pixels) and y is the label for the true digit that the image represents. In machine learning, the algorithms making use of the label is called supervised learning. We focus on supervised learning in this lecture. Supervised learning is as following: We have N 1 samples with known labels (called the training set). We create a model that relates x to the label. The requirement is that your model will have sufficient accuracy on the training set. Given a new image x for which we do not know the label (may be from test set, or may come from real world), we use our model to predict the label. (If the label is known after we predict, you can choose to update your model this type of learning is called online learning.) The numbers of parameters in your model affects the behavior of predicting. If there are too few parameters, then your model is not general enough. If you have too many degrees of freedom, you can adjust the parameters so that the accuracy on the training set is very high. However, often such models have poor behavior on new data. We say the model does not have good generalization. This is known to be overfitting. 1.1 The softmax model A very basic model regression model is the softmax regression, which we now use for MNIST. ρ i = j w ij x j + b i. 1

Here, {x j } is the vector by reshaping the images into a vector, while b i is a parameter to adjust the bias. Then, we define the probability that this image is number i by p i = softmax(ρ) i = exp(ρ i) j exp(ρ j). Now, we need to train the models and find the parameters w ij. We use an indicator, called the loss, to measure the behavior of our model. Here, we use the relative entropy as we have introduced in the SDE part: H(y y) = 9 9 y i log(y i/y i ) = y i log(y i ), i=0 i=0 where y is the predicted probability while y is the true probability (y j = 1 if the image is digit j). The loss function associated with the training set is L(w) = 1 N H((y n ) y n ). n=1 Hence, we aim to solve the optimization problem: w = argmin w X 1 N H((y n ) y n ). The loss function is often highly nonlinear, so many algorithms cannot be used. The most frequently used is the gradient descent (GD) method: n=1 w n+1 = w n η L(w n ), where η is called the learning rate. There are many online tutorial regarding how to use tensorflow to train MNIST. If you are interested in this, you can read. 1.2 Neural networks and the issue of GD The Softmax model is a very simple model. In recent years, people use artificial neural networks to approximate various practical models. The idea is to use linear combination and composition. 2

Let K be the so-called activation function, which can be the softmax function, or more often used sigmoid function. We consider some inputs {g i (x)} Np i=1. Then, the output is N p f(x) = K( w i g i (x)). i=1 Such a function or structure is called a layer of neural network. Of course, f can be used as new input. You will then construct some such f i s and then use K to generate new data. If you have several such layers, then the whole model is called a deep neural network. The final output will be regarded as the probability. Clearly, a deep neural network has many parameters, computing the derivatives of the loss function on the parameters is challenging. A known algorithm is called the back propagation. The issues arise as following 1. When the network is deep enough, the loss function could possibly have many local minimizers. The GD is easily trapped at local minimizers. Especially, the sharp local minimizers (this means the graph is steep around the minimizer) are regarded as bad, because people believe them have bad generalization behavior. 2. When the number of samples is large (N 1), the computation of the gradient is very expensive. 2 The SGD From here on, we will then use x to represent the parameters w. The idea of stochastic gradient descent (SGD) is that at each step, we just pick m samples randomly. Let M be the set of indices we pick. Then, we create the stochastic loss function L = 1 L j. m j M m is called the batch size of SGD. With the random loss function, we can then formulate SGD as X n+1 = X n η 1 L j (X n ). m 3 j M

Clearly, SGD in some extent improves the above issues. First of all, it introduces the randomness so that we can escape away from the sharp local minimizers. Secondly, the computational cost is reduced a lot. 2.1 Diffusion approximations If we fix m as a constant, then {X n } forms a time homogeneous Markov chain. We want to show that {X n } is a weak scheme for some SDE. Since SDE (Itô equation) gives diffusion processes, so we call the corresponding SDE the diffusion approximation of SGD. For related references, read Stochastic modified equations and adaptive stochastic gradient algorithms (Li, Tai, E) and semigroups of stochastic gradient descent and online principal component analysis: properties and diffusion approximations Recall the operator we have defined (S n φ)(x) = E(φ(X n ) X 0 = x) We have shown that {S n } forms a semigroup, and hence it is enough to analyze one step: (Sφ)(x) = Eφ(x η 1 f j (x)) m j M For simplicity, we focus on m = 1 and define f(x) = 1 N f n (x). n=0 This is clearly the loss function. We have (Sφ)(x) = Eφ(x η f j (x)) = 1 N φ(x η f n (x)). n=1 Direct Taylor expansion gives (Sφ)(x) = φ(x) η f(x) φ(x) + 1 N η 2 f n f n : 2 φ(x) + O(η 3 ). n=1 Suppose that there is some corresponding SDE: dx = b dt + σdw. 4

As we have done before, we do semigroup expansion and we have e ηl φ(x) = φ(x) + ηlφ + 1 2 η2 L 2 φ(x) + O(η 3 ). Clearly, for first order weak approximation, we only need to require This is just the deterministic ODE: L = f(x). Ẋ = f(x). To get second order approximation, we must require that L depends on η. This shares similarity with the ideas of modified equation we have talked before. For next order, let us try With detailed computation, we have L = ( f(x) + ηb 1 ) + ησ : 2. b 1 = 1 4 f(x) 2, Σ = 1 N ( f(x) f k (x)) ( f(x) f k (x)) = V ar( f n (x)). k=1 This corresponds to SDE dx = ( f + η 1 4 f 2 )dt + ησdw. Remark 1. The diffusion approximation is only valid for fixed time interval. The term η 1 4 f 2 is due to the fact that forward Euler scheme ODE has first order accuracy. If we want second order, we have to correct the highest term in modified equation. The term ησdw is the crucial part, which captures the most fluctuation. 2.2 Using diffusion approximation to understand SGD For related references, read Stochastic modified equations and adaptive stochastic gradient algorithms and On the diffusion approximation of nonconvex stochastic gradient descent. Though the diffusion approximation is only valid for fixed time interval, we can still use it to understand some behaviors of SGD. 5

Cooling down and settling down Clearly, we hope to arrive at flat minimizers. So, if we arrive there we do not hope to escape again. The idea is to decrease the learning rate η. (This is very like simulated annealing.) A long used strategy is to set η 1 n. To justify this, Li, Tai and E in their paper considered the diffusion approximation dx = u t f (X)dt + u t ησdw. The objective function to optimalize is min u Ef(X T ), which corresponds to some Hamliton-Jacobi equation. In the case that f = 1 2 a(x b)2, this HJB equation can be solved analytically. From here, they can justify the rate 1/n is some sense. (If you are interested, you can read the original paper.) Escaping saddle point Consider the SDE dx = f(x)dt + ɛσdw. In the paper of Kifer, it has been shown that the time that the SDE used to escape a saddle point is at most O(log ɛ 1 ). Hence, we expect that SGD will escape a saddle point with iterations like O(η 1 log η 1 ). Current results have proved that the iteration needed could be like O(η 2 log(η) ). Proving the sharp bounds seems challenging. Behavior near local minimum. For the SDE dx = f(x)dt + ɛσdw. By the large deviation theory, the time used to escape of a minimum is like O(exp( C/ɛ)), ɛ 0. 6

Hence, if the noise is very low, the system will be trapped at a local minimum. We should keep the noise level high enough if we want to escape. For a moderate ɛ, the time is related to the Hessian of f as well (see some standard reference on large deviations). If 2 f has small eigenvalues, it is hard to increase the function value by some treshold. On the contrary, if the Hessian is big, escaping away is relatively easy. Understanding the effects of batch size. In many machine learning papers (such as the one by Keskar), numerical simulation seems to suggest using small batch sizes in the early stage will yield better generalization behavior. In On the diffusion approximation of nonconvex stochastic gradient descent, this was attempted to be justified. Intuitively, small batch size leads to higher level of noise so that the system can escape from the sharp local minimizers. 2.3 Long time behaviors The convergence to minimum for strongly convex case. If the objective functions f n (x) are all strongly convex, it can be rigorously proved that choosing η = 1/n will lead SGD to converge to the minimizer using Martingale convergence theorem. (See the lecture slides by Powell, 2012). Below, we attach the proof here for convenience. Remark 2. Suppose that f(x) is convex with a unique minimizer at x. Consider the SGD X n+1 = X n η n f(x n ; ξ n ) We assume f(x; ξ) B <, η n =, n ηn 2 <. n (Not constant step size.) Here, we show that X n x, a.s. 7

The idea is to consider Y n = X n x 2 + B 2 ηk 2 k=n it is easily verified that Y n is a non-negative super-martingale. This is because Y n+1 Y n 2η n f(x n, ξ n ) (X n x ) The expectation conditioning on F n on right hand is non-positive: E( f(x n, ξ n ) (X n x ) F n ) = f(x n ) (X n x ) 0. Then, Y n Y a.s. with EY EY 0. The above inequality also implies that Hence, we find 2 n 1 Y n Y 0 2 η k f(x k, ξ k ) (X k x ) k=0 n η k E( f(x k, ξ k ) (X k x )) EY 0 EY n EY 0. k=0 Each term on the left hand side is non-negative. Hence, the sum converges. (Note that we do not know whether EY = lim n EY n or not). Since η n =, we conclude that lim inf k E( f(x k, ξ k ) (X k x )) = 0. If we know f is strictly convex, we have immediately that E( f(x k, ξ k ) (X k x )) = E( f(x k ) (X k x )) m X k x 2, ( f(x ) = 0) Then, the claim follows. If we do not have strict convexity, use f(x k ) f(x ). Then, along a subsequence, f(x nk ) f(x ) a.s., implying that X nk x a.s. This then shows that Y = 0 a.e. 8

Asymptotic behavior for constant learning rate. If the learning rate is constant, using some Markov chain theory ( Bridging the gap between constant step size stochastic gradient descent and Markov chains ), for strongly convex target functions, the detailed behaviors can be analyzed. The SGD is shown to have a unique invariant measure when η is small enough. Moreover, some asymptotic expansions can be obtained. The nonasymptotic analysis For the interested readers, we mention the paper Non-Convex Learning via Stochastic Gradient Langevin Dynamics: A Nonasymptotic Analysis. Remark 3. Another thing is to justify the diffusion approximation for long time. This is possible when the objective functions are all strongly convex. In fact, we are making progress on this 3 Other possible approaches Stochastic coordinate descent Coarse gradient proposed by Jack Xin 9