Advanced computational methods X Selected Topics: SGD

Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety of handwriting digital images. You can find them on Yann LeCun s website. Each MNIST datum is like (x, y) where x is the image (array of pixels) and y is the label for the true digit that the image represents. In machine learning, the algorithms making use of the label is called supervised learning. We focus on supervised learning in this lecture. Supervised learning is as following: We have N 1 samples with known labels (called the training set). We create a model that relates x to the label. The requirement is that your model will have sufficient accuracy on the training set. Given a new image x for which we do not know the label (may be from test set, or may come from real world), we use our model to predict the label. (If the label is known after we predict, you can choose to update your model this type of learning is called online learning.) The numbers of parameters in your model affects the behavior of predicting. If there are too few parameters, then your model is not general enough. If you have too many degrees of freedom, you can adjust the parameters so that the accuracy on the training set is very high. However, often such models have poor behavior on new data. We say the model does not have good generalization. This is known to be overfitting. 1.1 The softmax model A very basic model regression model is the softmax regression, which we now use for MNIST. ρ i = j w ij x j + b i. 1

Here, {x j } is the vector by reshaping the images into a vector, while b i is a parameter to adjust the bias. Then, we define the probability that this image is number i by p i = softmax(ρ) i = exp(ρ i) j exp(ρ j). Now, we need to train the models and find the parameters w ij. We use an indicator, called the loss, to measure the behavior of our model. Here, we use the relative entropy as we have introduced in the SDE part: H(y y) = 9 9 y i log(y i/y i ) = y i log(y i ), i=0 i=0 where y is the predicted probability while y is the true probability (y j = 1 if the image is digit j). The loss function associated with the training set is L(w) = 1 N H((y n ) y n ). n=1 Hence, we aim to solve the optimization problem: w = argmin w X 1 N H((y n ) y n ). The loss function is often highly nonlinear, so many algorithms cannot be used. The most frequently used is the gradient descent (GD) method: n=1 w n+1 = w n η L(w n ), where η is called the learning rate. There are many online tutorial regarding how to use tensorflow to train MNIST. If you are interested in this, you can read. 1.2 Neural networks and the issue of GD The Softmax model is a very simple model. In recent years, people use artificial neural networks to approximate various practical models. The idea is to use linear combination and composition. 2

Let K be the so-called activation function, which can be the softmax function, or more often used sigmoid function. We consider some inputs {g i (x)} Np i=1. Then, the output is N p f(x) = K( w i g i (x)). i=1 Such a function or structure is called a layer of neural network. Of course, f can be used as new input. You will then construct some such f i s and then use K to generate new data. If you have several such layers, then the whole model is called a deep neural network. The final output will be regarded as the probability. Clearly, a deep neural network has many parameters, computing the derivatives of the loss function on the parameters is challenging. A known algorithm is called the back propagation. The issues arise as following 1. When the network is deep enough, the loss function could possibly have many local minimizers. The GD is easily trapped at local minimizers. Especially, the sharp local minimizers (this means the graph is steep around the minimizer) are regarded as bad, because people believe them have bad generalization behavior. 2. When the number of samples is large (N 1), the computation of the gradient is very expensive. 2 The SGD From here on, we will then use x to represent the parameters w. The idea of stochastic gradient descent (SGD) is that at each step, we just pick m samples randomly. Let M be the set of indices we pick. Then, we create the stochastic loss function L = 1 L j. m j M m is called the batch size of SGD. With the random loss function, we can then formulate SGD as X n+1 = X n η 1 L j (X n ). m 3 j M

Clearly, SGD in some extent improves the above issues. First of all, it introduces the randomness so that we can escape away from the sharp local minimizers. Secondly, the computational cost is reduced a lot. 2.1 Diffusion approximations If we fix m as a constant, then {X n } forms a time homogeneous Markov chain. We want to show that {X n } is a weak scheme for some SDE. Since SDE (Itô equation) gives diffusion processes, so we call the corresponding SDE the diffusion approximation of SGD. For related references, read Stochastic modified equations and adaptive stochastic gradient algorithms (Li, Tai, E) and semigroups of stochastic gradient descent and online principal component analysis: properties and diffusion approximations Recall the operator we have defined (S n φ)(x) = E(φ(X n ) X 0 = x) We have shown that {S n } forms a semigroup, and hence it is enough to analyze one step: (Sφ)(x) = Eφ(x η 1 f j (x)) m j M For simplicity, we focus on m = 1 and define f(x) = 1 N f n (x). n=0 This is clearly the loss function. We have (Sφ)(x) = Eφ(x η f j (x)) = 1 N φ(x η f n (x)). n=1 Direct Taylor expansion gives (Sφ)(x) = φ(x) η f(x) φ(x) + 1 N η 2 f n f n : 2 φ(x) + O(η 3 ). n=1 Suppose that there is some corresponding SDE: dx = b dt + σdw. 4

As we have done before, we do semigroup expansion and we have e ηl φ(x) = φ(x) + ηlφ + 1 2 η2 L 2 φ(x) + O(η 3 ). Clearly, for first order weak approximation, we only need to require This is just the deterministic ODE: L = f(x). Ẋ = f(x). To get second order approximation, we must require that L depends on η. This shares similarity with the ideas of modified equation we have talked before. For next order, let us try With detailed computation, we have L = ( f(x) + ηb 1 ) + ησ : 2. b 1 = 1 4 f(x) 2, Σ = 1 N ( f(x) f k (x)) ( f(x) f k (x)) = V ar( f n (x)). k=1 This corresponds to SDE dx = ( f + η 1 4 f 2 )dt + ησdw. Remark 1. The diffusion approximation is only valid for fixed time interval. The term η 1 4 f 2 is due to the fact that forward Euler scheme ODE has first order accuracy. If we want second order, we have to correct the highest term in modified equation. The term ησdw is the crucial part, which captures the most fluctuation. 2.2 Using diffusion approximation to understand SGD For related references, read Stochastic modified equations and adaptive stochastic gradient algorithms and On the diffusion approximation of nonconvex stochastic gradient descent. Though the diffusion approximation is only valid for fixed time interval, we can still use it to understand some behaviors of SGD. 5

Cooling down and settling down Clearly, we hope to arrive at flat minimizers. So, if we arrive there we do not hope to escape again. The idea is to decrease the learning rate η. (This is very like simulated annealing.) A long used strategy is to set η 1 n. To justify this, Li, Tai and E in their paper considered the diffusion approximation dx = u t f (X)dt + u t ησdw. The objective function to optimalize is min u Ef(X T ), which corresponds to some Hamliton-Jacobi equation. In the case that f = 1 2 a(x b)2, this HJB equation can be solved analytically. From here, they can justify the rate 1/n is some sense. (If you are interested, you can read the original paper.) Escaping saddle point Consider the SDE dx = f(x)dt + ɛσdw. In the paper of Kifer, it has been shown that the time that the SDE used to escape a saddle point is at most O(log ɛ 1 ). Hence, we expect that SGD will escape a saddle point with iterations like O(η 1 log η 1 ). Current results have proved that the iteration needed could be like O(η 2 log(η) ). Proving the sharp bounds seems challenging. Behavior near local minimum. For the SDE dx = f(x)dt + ɛσdw. By the large deviation theory, the time used to escape of a minimum is like O(exp( C/ɛ)), ɛ 0. 6

Hence, if the noise is very low, the system will be trapped at a local minimum. We should keep the noise level high enough if we want to escape. For a moderate ɛ, the time is related to the Hessian of f as well (see some standard reference on large deviations). If 2 f has small eigenvalues, it is hard to increase the function value by some treshold. On the contrary, if the Hessian is big, escaping away is relatively easy. Understanding the effects of batch size. In many machine learning papers (such as the one by Keskar), numerical simulation seems to suggest using small batch sizes in the early stage will yield better generalization behavior. In On the diffusion approximation of nonconvex stochastic gradient descent, this was attempted to be justified. Intuitively, small batch size leads to higher level of noise so that the system can escape from the sharp local minimizers. 2.3 Long time behaviors The convergence to minimum for strongly convex case. If the objective functions f n (x) are all strongly convex, it can be rigorously proved that choosing η = 1/n will lead SGD to converge to the minimizer using Martingale convergence theorem. (See the lecture slides by Powell, 2012). Below, we attach the proof here for convenience. Remark 2. Suppose that f(x) is convex with a unique minimizer at x. Consider the SGD X n+1 = X n η n f(x n ; ξ n ) We assume f(x; ξ) B <, η n =, n ηn 2 <. n (Not constant step size.) Here, we show that X n x, a.s. 7

The idea is to consider Y n = X n x 2 + B 2 ηk 2 k=n it is easily verified that Y n is a non-negative super-martingale. This is because Y n+1 Y n 2η n f(x n, ξ n ) (X n x ) The expectation conditioning on F n on right hand is non-positive: E( f(x n, ξ n ) (X n x ) F n ) = f(x n ) (X n x ) 0. Then, Y n Y a.s. with EY EY 0. The above inequality also implies that Hence, we find 2 n 1 Y n Y 0 2 η k f(x k, ξ k ) (X k x ) k=0 n η k E( f(x k, ξ k ) (X k x )) EY 0 EY n EY 0. k=0 Each term on the left hand side is non-negative. Hence, the sum converges. (Note that we do not know whether EY = lim n EY n or not). Since η n =, we conclude that lim inf k E( f(x k, ξ k ) (X k x )) = 0. If we know f is strictly convex, we have immediately that E( f(x k, ξ k ) (X k x )) = E( f(x k ) (X k x )) m X k x 2, ( f(x ) = 0) Then, the claim follows. If we do not have strict convexity, use f(x k ) f(x ). Then, along a subsequence, f(x nk ) f(x ) a.s., implying that X nk x a.s. This then shows that Y = 0 a.e. 8

Asymptotic behavior for constant learning rate. If the learning rate is constant, using some Markov chain theory ( Bridging the gap between constant step size stochastic gradient descent and Markov chains ), for strongly convex target functions, the detailed behaviors can be analyzed. The SGD is shown to have a unique invariant measure when η is small enough. Moreover, some asymptotic expansions can be obtained. The nonasymptotic analysis For the interested readers, we mention the paper Non-Convex Learning via Stochastic Gradient Langevin Dynamics: A Nonasymptotic Analysis. Remark 3. Another thing is to justify the diffusion approximation for long time. This is possible when the objective functions are all strongly convex. In fact, we are making progress on this 3 Other possible approaches Stochastic coordinate descent Coarse gradient proposed by Jack Xin 9