arxiv: v2 [cs.lg] 16 Aug 2018

Size: px
Start display at page:

Download "arxiv: v2 [cs.lg] 16 Aug 2018"

Transcription

1 An Alternative View: When Does SGD Escape Local Minima? arxiv: v [cs.lg] 16 Aug 018 Robert Kleinberg Department of Computer Science Cornell University Yang Yuan Department of Computer Science Cornell University August 17, 018 Abstract Yuanzhi Li Department of Computer Science Princeton University Stochastic gradient descent (SGD) is widely used in machine learning. Although being commonly viewed as a fast but not accurate version of gradient descent (GD), it always finds better solutions than GD for modern neural networks. In order to understand this phenomenon, we take an alternative view that SGD is working on the convolved (thus smoothed) version of the loss function. We show that, even if the function f has many bad local minima or saddle points, as long as for every point, the weighted average of the gradients of its neighborhoods is one point conve with respect to the desired solution, SGD will get close to, and then stay around with constant probability. More specifically, SGD will not get stuck at sharp local minima with small diameters, as long as the neighborhoods of these regions contain enough gradient information. The neighborhood size is controlled by step size and gradient noise. Our result identifies a set of functions that SGD provably works, which is much larger than the set of conve functions. Empirically, we observe that the loss surface of neural networks enjoys nice one point conveity properties locally, therefore our theorem helps eplain why SGD works so well for neural networks. 1 Introduction Nowadays, stochastic gradient descent (SGD), as well as its variants (Adam [19], Momentum [8], Adagrad [6], etc.) have become the de facto algorithms for training neural networks. SGD runs iterative updates for the weights t : t+1 = t ηv t, where η is the step size 1. v t is the stochastic gradient that satisfies E[v t ] = f( t ), and is usually computed using a mini-batch of the dataset. In the regime of conve optimization, SGD is proved to be a nice tradeoff between accuracy and efficiency: it requires more iterations to converge, but fewer gradient evaluations per iteration. Therefore, for the standard empirical risk minimizing problems with n points and smoothness L, to get to ɛ-close to, GD needs O(Ln/ɛ) gradient evaluations [4], but SGD 1 In this paper, we use step size and learning rate interchangeably. 1

2 y t t η f( t ) ηω t ηv t Local minimum basin t+ 3 1 t+1 y t+1 Figure 1: SGD path t t+1 can be decomposed into t y t t+1. If the local minimum basin has small diameter, the gradient at t+1 will point away from the basin t y t t+1 y t Figure : 3D version of Figure 1: SGD could escape a local minimum within one step. 0 with reduced variance only needs O(n log 1 ɛ + L ɛ ) gradient evaluations [17, 4, 6, 1]. In these scenarios, noise is a by-product of cheap gradient computation, and does not help training. By contrast, for non-conve optimization problems like training neural networks, noise seems crucial. It is observed that with the help of noisy gradients, SGD does not only converge faster, but also converge to a better solution compared with GD [18]. To formally understand this phenomenon, people have analyzed the role of noise in various settings. For eample, it is proved that noise helps to escape saddle points [7, 16], gives better generalization [9, 3], and also guarantees polynomial hitting time of good local minima under some assumptions [9]. However, it is still unclear why SGD could converge to better local minima than GD. Empirically, in additional to the gradient noise, the step size is observed to be a key factor in optimization. More specifically, small step size helps refine the network and converge to a local minimum, while large step size helps escape the current local minimum and go towards a better one [15, 1]. Thus, standard training schedule for modern networks uses large step size first, and shrinks it later [10, 14]. While using large step sizes to escape local minima matches with intuition, the eisting analysis on SGD for non-conve objectives always considers the small-step-size settings [7, 16, 9, 9]. See Figure 1 for an illustration. Consider the scenario that for some t, instead of pointing to the solution (not shown), its negative gradient points to a bad local minimum, so following the full gradient we will arrive y t t η f( t ). Fortunately, since we are running SGD, the actual direction we take is ηv t = η( f( t ) + ω t ), where ω t is the noise with E[ω t ] = 0, ω t W ( t ). As we show in Figure 1, if we take a large η, we may get out of the basin region with the help of noise, i.e., from y t to t+1. Here, getting out of the basin means the negative gradient at t+1 no longer points to (See also Figure ). To formalize this intuition, instead of analyzing the sequence t t+1, let us look at the sequence y t y t+1, where y t is defined to be t η f( t ), as in the preceding paragraph. The SGD algorithm never computes these vectors y t, but we are only using them as an analysis tool. W ( t) is data dependent.

3 From the equation t+1 = y t ηω t we obtain the following update rule relating y t+1 to y t. y t+1 = y t ηω t η f(y t ηω t ) (1) The random vector ηω t in (1) has epectation 0, so if we take the epectation of both sides of (1), we get E ωt [y t+1 ] = y t η E ωt [f(y t ηω t )]. Therefore, if we define g t to be the function g t (y) = E ωt [f(y ηω t )], which is simply the original function f convolved with the η-scaled gradient noise, then the sequence y t is approimately doing gradient descent on the sequence of functions (g t ). This alternative view helps to eplain why SGD converges to a good local minimum, even when f has many other sharp local minima. Intuitively, sharp local minima are eliminated by the convolution operator that transforms f to g t, since convolution has the effect of smoothing out short-range fluctuations. This reasoning ensures that SGD converges to a good local minimum under much weaker conditions, because instead of imposing conveity or one-point conveity requirements on f itself, we only require those properties to hold for the smoothed functions obtained from f by convolution. We can formalize the foregoing argument using the following assumption. Assumption 1 (Main Assumption). For a fied point 3, noise distribution W (), step size η, the function f is c-one point strongly conve with respect to after convolved with noise. That is, for any, y in domain D s.t. y = η f(), E ω W () f(y ηω), y c y () For point y, since the direction y points to, by having positive inner product with y, we know the direction η f(y t ηω t ) in (1) approimately points to in epectation (See more discussion on one point conveity in Appendi). Therefore, y t will converge to with decent probability: Theorem 1 (Main Theorem, Informal). Assume f is smooth, for every D, W () s.t., ma ω W () ω r. Also assume η is bounded by a constant, and Assumption 1 holds with, η, and c. For T 1 Õ( 1 ηc )4, and any T > 0, with probability at least 1/, we have y t O(log(T ) ηr c ) for any t s.t., T 1 + T t T 1. Notice that our main theorem not only says SGD will get close to, but also says with constant probability, SGD will stay close to for the future T steps. As we will see in Section 5, we observe that Assumption 1 holds along the SGD trajectory for the modern neural networks when the noise comes from real data mini-batches. Moreover, the SGD trajectory matches with our theory prediction in practice. Our main theorem can also help eplain why SGD could escape sharp local minima and converge to flat local minima in practice [18]. Indeed, the sharp local minima have small loss value and small diameter, so after convolved with the noise kernel, they easily disappear, which means Assumption 1 holds. However, flat local minima have large diameter, so they still eists after convolution. In that case, our main theorem says, it is more likely that SGD will converge to flat local minima, instead of sharp local minima. 3 Notice that is not necessarily the global optimal in the original function f due to the convolution operator. 4 We use Õ to hide log terms here. 3

4 Original function f Gradient Descent without noise 15.0 Last Iter 1.5 Last Iter Count GD, noise=0.3, init=[-3,3] Last Iter Last Iter Count f convolved with [ 0.15, 0.15] noisy region Gradient Descent with noise= GD, noise=0.15, init=[-1.5,1.5] f convolved with [ 0.3, 0.3] noisy region Gradient Descent with noise= GD, noise=5, init=[-0.5,0.5] f convolved with [ 0.6, 0.6] noisy region Gradient Descent with noise= GD, noise=1, init=[-0.3,0.3] Figure 3: Running SGD on a spiky function f. Row 1: f gets smoother after convolving with uniform random noise. Row : Run SGD with different noise levels. Every figure is obtained with 100 trials with different random initializations. Red dots represent the last iterates of these trials, while blue bars represent the cumulative counts. GD without noise easily gets stuck at various local minima, while SGD with appropriate noise level converges to a local region. Row 3: In order to get closer to, one may run SGD in multiple stages with shrinking learning rates. 1.1 Related Work Previously, people already realized that the noise in the gradient could help SGD to escape saddle points [7, 16] or achieve better generalization [9, 3]. With the help of noise, SGD can also be viewed as doing approimate Bayesian inference [] or variational inference [3]. Besides, it is proved that SGD with etra noise could hit a local minimum with small loss value in polynomial time under some assumptions [9]. However, the etra noise is too big to guarantee convergence, and that model cannot deal with escaping sharp local minima. Escaping sharp local minima for neural network is important, because it is conjectured (although controversial [5]) that flat local minima may lead to better generalization [11, 18, ]. It is also observed that the correct learning rate schedule (small or large) is crucial for escaping bad local minima [15, 1]. Furthermore, solutions that are farther away from the initialization may lead to wider local minima and better generalization [13]. Under a Bayesian perspective, it is shown that the noise in stochastic gradient could drive SGD away from sharp minima, which decides the optimal batch size [7]. There are also eplanations for why small batch methods prefers flat minima while large batch methods are not, by investigating the canonical quadratic sums problem [5]. 4

5 To visualize the loss surface of neural network, a common practice is projecting it onto a one dimensional line [8], which was observed to be conve. For the simple two layer neural network, a local one point strongly conveity property provably holds under Gaussian input assumption [0]. Motivating Eample Let us first see a simple eample in Figure 3. We use F r,c to denote the sub-figure at row r and column c. The function f at F 1,1 is a approimately conve function, but very spiky. Therefore, GD easily gets stuck at various local minima, see F,1. However, we want to get rid of those spurious local minima, and get a point near = 0. If we take the alternative view that SGD works on the convolved version of f (F 1,, F 1,3, F 1,4 ), we find that those functions are much smoother and contain few local minima. However, the gradient noise here is a double-edged sword. On one hand, if the noise is small, the convolved f is still somewhat non-conve, then SGD may find a few bad local minima as shown in F,. On the other hand, if the noise is too large, the noise dominates the gradient, and SGD will act like random walk, see F,4. F,3 seems like a nice tradeoff, as all trials converges to a local region near 0, but the region is too big (most points are in [ 1.5, 1.5]). In order to get closer to 0, we may restart SGD with a point in [ 1.5, 1.5], using smaller noise level Recall in F,, SGD fails because the convolved f has a few non-conve regions (F 1, ), so SGD may find spurious local minima. However, those local minima are outside [ 1.5, 1.5]. The convolved f in F 1, restricted in [ 1.5, 1.5] is pretty conve, so if we start a point in this region, SGD converges to a smaller local region centered at 0, see F 3,. We may do this iteratively, with even smaller noise levels and smaller initialization regions, and finally we will get pretty close to 0 with decent probability, see F 3,3 and F 3,4. 3 Main Theorem Definition 1 (Smoothness). Function f R d R is L-smooth, if for any, y R d, f(y) f() + f (), y + L y Assume that we are running SGD on the sequence { t }. Recall the update rule (1) for y t. Our main theorem says that {y t } is converging to and will stay around afterwards. Theorem 1 (Main Theorem). Assume f is L-smooth, for every D, W () s.t., ma ω W () ω r. For a fied target solution, if there eists constant c, η > 0, such that Assumption 1 holds with, η, c, and η < min{ 1 L, c, 1 L c }, ηc η L, b η r (1 + ηl). Then for any fied T 1 log( y 0 /b) y t O ( log(t )b and ) T > 0, with probability at least 1/, we have y T 0b and for all t s.t., T 1 + T t T 1. We defer the proof to Section 4. Remark. For fied c, there eists a lower bound on η to satisfy Assumption 1, so η cannot be arbitrarily small. However, the main theorem says within T 1 + T steps, SGD will stay in a 5

6 ( ) local region centered at with diameter O log(t )b, which is essentially Õ(ηr /c) that scales with η. In order to get closer to, a common trick in practice is to restart SGD with smaller step size η within the local region. If f inside this region has better geometric properties (which is usually true), one gets better convergence guarantee: Corollary (Shrinking Learning Rate). If the assumptions in Theorem 1 holds, and f restricted in the local region D { 0b } satisfy the same assumption with c > c, η < η, then log( d b ) if we run SGD with η for the first T 1 steps, with probability at least 1/4, we have y T1 +T 0b steps, and with η for the net T log( 0b b ) < 0b. This corollary can be easily generalized to shrink the learning rate multiple times. Our main theorem is based on the important assumption that the step size is bounded. If the step size is too big, even if the whole function f is one point conve (a stronger assumption than Assumption 1), and we run full gradient descent, we may not keep getting closer to, as we show below. Theorem 3. For function f, if, f(), c, and we are at the point t. If we run full gradient descent with step size η > c t, we have f( t) t+1 t. Proof. The proof is straightforward and we defer it to Appendi C. This theorem can be best illustrated with Figure 4. If η is too big, although the gradient (the arrow) is pointing to the approimately correct direction, t+1 will be farther away from (going outside of the -centered ball). Although this theorem analyzes the simple full gradient case, SGD is similar. In the high dimensional case, it is natural to assume that most of the noise will be orthogonal to the direction of t, therefore with additional noise inside the stochastic gradient, a large step size will drive t+1 away from more easily. t t+1 if η too big Figure 4: When step size is too big, even the gradient is one point conve, we may still go farther away from. Therefore, our paper provides a theoretical eplanation for why picking step size is so important (too big or too small will not work). We hope it could lead to more practical guidelines in the future. 4 Proof for Theorem 1 In the proof, we will use the following lemma. Theorem 4 (Azuma). Let X 1, X,, X n be independent random variables satisfying X i E(X i ) c i, for 1 i n. We have the following bound for the sum X = n i=1 X i: Pr( X E(X) ) e n c i=1 i. Our proof has four steps. Step 1. Since Assumption 1 holds, we show that SGD always makes progress towards in epectation, plus some noise. 6

7 Let filtration F t = σ{ω 0,, ω t 1 }, where σ{ } denotes the sigma field. Notice that for any ω t W ( t ), we have E[ω t F t ] = 0. Thus, E[ y t+1 F t ] = E[ y t ηω t η f(y t ηω t ) F t ] ] =E [ y t η f(y t ηω t ) + ηω t ηω t, y t η f(y t ηω t ) F t ] E [ y t η f(y t ηω t ) + η r ηω t, η f(y t ηω t ) + η f(y t ) η f(y t ) F t ] E [ y t + η f(y t ηω t ) η f(y t ηω t ), y t + η r + η 3 r L F t ] y t + E [η f(y t ηω t ) F t + η r η E ωt W ( t)f(y t ηω t ), y t + η 3 r L ] (1 ηc) y t + η r + η 3 r L + E [η L y t + ηω t F t (1 ηc) y t + η r + η 3 r L + η L y t + η 4 r L =(1 ηc + η L ) y t + η r (1 + ηl) Step. Since SGD makes progress in every step, after many steps, SGD gets very close to in epectation. By Markov inequality, this event holds with large probability. Notice that since η < c, we have = ηc η L > ηc > 0. Recall b η r (1 + ηl), we L get: E[ y t+1 F t ] (1 ) y t + b Let G t = (1 ) t ( y t b ), we get: E[G t+1 F t ] G t That means, G t is a supermartingale. We have E[G T1 F T1 1] G 0 Which gives [ E y T1 b F T1 1 ] (1 ) T 1 ( y 0 b ) (1 ) T 1 y 0 That is, Since T 1 ( y log 0 ) b E[ y T1 F T1 1] b + (1 )T 1 y 0, we get: E[ y T1 F T1 1] b By Markov inequality, we know with probability at least 0.9, y T1 0b (3) 7

8 For notational simplicity, for the analysis below we relabel the point y T1 as y 0. Therefore, at time 0 we already have y 0 0b. Step 3. Conditioned on the event that we are close to, below we show that if for t 0 > t 0, y t is close to, then y t0 is also close to with high probability. Let ζ = 9T 4. Let event E t = { τ t, y τ b µ = δ}, where µ is a parameter satisfies µ ma{8, 4 log 1 (ζ)}. If with probability 5 9, E t holds for every t T, we are done. By the previous calculation, we know that (1 Et is the indicator function for E t ) E[G t 1 Et 1 F t 1 ] G t 1 1 Et 1 G t 1 1 Et So G t 1 Et 1 is a supermartingale, with the initial value G 0. In order to apply Azuma inequality, we first bound the following term (notice that we use E[ω t ] = 0 multiple times): G t+1 1 Et E[G t+1 1 Et F t ] =(1 ) t y t ηω t η f(y t ηω t ) E[ y t ηω t η f(y t ηω t ) F t ] 1 Et (1 ) t ηω t, y t η f(y t ηω t ) + ηω t + y t η f(y t ηω t ) E[ ηω t, y t η f(y t ηω t ) + ηω t + y t η f(y t ηω t ) F t ] =(1 ) t ηω t E[ ηω t F t ] ηω t, y t η f(y t ηω t ) + y t η f(y t ηω t ) E[ ηω t, η f(y t ηω t ) + y t η f(y t ηω t ) F t ] (1 ) t η r + ηr y t + ηω t, η f(y t ηω t ) + η f(y t ηω t ) η f(y t ) + η f(y t ) E[ η f(y t ηω t ) η f(y t ) + η f(y t ) F t ] + y t, η f(y t ηω t ) E[η f(y t ηω t ) F t ] E[ ηω t, η f(y t ηω t ) F t ] (1 ) t η r + ηr y t + 4η r f(y t ηω t ) + η (η r L + f(y t ), f(y t ηω t ) f(y t ) E[ f(y t ηω t ) f(y t ) F t ] ) + η y t, f(y t ηω t ) f(y t ) E[ f(y t ηω t ) f(y t ) F t ] =(1 ) t η r + ηr y t + 4η rl(ηr + y t ) + η ( η r L + 4L y t ηrl ) + 4η rl y t (1 ) t ( 3.5η r + 7ηrδ ) Where the last inequality uses the fact that ηl 1 and y t δ (as 1 Et holds). Let M 3.5η r + 7ηrδ. Let d τ = G τ 1 Eτ 1 E[G τ 1 Eτ 1 F t ], we have t d τ = τ=1 t (1 ) τ M τ=1 r t = t d τ = M t (1 ) τ τ=1 Apply Azuma inequality (Theorem 4), for any ζ > 0, we know Pr(G t 1 Et 1 G 0 ( ) r t log 1 r (ζ)) ep t log(ζ) t = ep log(ζ) = 1 τ=1 d τ ζ 8 τ=1

9 Therefore, with probability 1 1 ζ, G t 1 Et 1 G 0 + r t log 1 (ζ) Step 4. The inequality above says, if E t 1 holds, i.e., for all τ t 1, y τ δ, then with probability 1 1 ζ, G t is bounded. If we can show from the upper bound of G t that y t δ is also true, we automatically get E t holds. In other words, that means if E t 1 holds, then E t holds with probability 1 1 ζ. Therefore, by applying this claim T times, we get E T holds with probability 1 T ζ = 5 9. Combining with inequality (3), we know with probability at least 1/, the theorem statement holds. Thus, it remains to show that y t δ. If G t 1 Et 1 G 0 + r t log 1 (ζ), we know ( (1 ) t y t b ) y 0 b + r t log 1 (ζ) So Notice that y t (1 ) t ( y 0 + r t log 1 (ζ) ) + b y 0 + (1 ) t r t log 1 (ζ) + b (1 ) t r t = (1 ) t M t (1 ) τ = M t (1 ) (t τ) τ=1 =M t 1 (1 ) τ 1 M 1 (1 ) M ηc τ=0 τ=1 1 The second last inequality holds because we know = 1 1 (1 ) 1 1 ηc, since = ηc η L ηc < 1, and > ηc. That means, y t y 0 + (3.5η r + 7ηrδ) ηc M ηc log 1 (ζ) + b log 1 (ζ) + 1b It remains to prove the following lemma, which we defer to Appendi B. Lemma 5. (3.5η r + 7ηrδ) ηc log 1 (ζ) + 1b δ Therefore, y t δ. Combining the 4 steps together, we have proved the theorem. 9

10 Inner Product Densenet_Cifar10 Densenet_Cifar100 Resnet_Cifar10 Resnet_Cifar Epochs Inner Product Among Neighborhood Densenet_Cifar10 Densenet_Cifar100 Resnet_Cifar10 Resnet_Cifar Epochs Norm of Stochastic Gradients Densenet_Cifar10 Densenet_Cifar100 Resnet_Cifar10 Resnet_Cifar Epochs (a) SGD trajectory is locally one point conve. (b) The neighborhood of SGD trajectory is one point conve. (c) The norm of stochastic gradient Figure 5: (a). The inner product between the negative gradient and 300 t for each epoch t 5 is always positive. Every data point is the minimum value among 5 trials. (b). Neighborhood of SGD trajectory is also one point conve with respect to 300. (c). Norm of stochastic gradient 5 Empirical Observations In this section, we eplore the loss surfaces of modern neural networks, and show that they enjoy many nice one point conve properties. Therefore, our main theorem could be used for eplaining why SGD works so well in practice. 5.1 The SGD trajectory is one point conve It is well known that the loss surface of neural network is highly non-conve, with numerous local minima. However, we observe that the loss surface is consisted of many one point conve basin region, while each time SGD traverses one of such regions. See Figure 5a for details. We ran eperiments on Resnet [10] (34 layers, 1.M parameters), Densenet [14] (100 layers, 0.8M parameters) on cifar10 and cifar100, each for 5 trials with 300 epochs and different initializations. For the start of every epoch t in each trial, we compute the inner product between the negative gradient f( t ) and the direction 300 t. In Figure 5a, we plot the minimum value for every epoch among 5 trials for each setting. Notice that ecept for the starting period of densenet on Cifar-10, all the other networks in all trials have positive inner products, which shows that the trajectory of SGD (ecept the starting period) is one point conve with respect to the final solution 5. In these eperiments, we have used the standard step size schedule (0.1 initially, 1 after epoch 150, and 01 after epoch 5). However, we got the same observation when using smoothly decreasing step sizes (shrink by 0.99 per epoch). 5. The neighborhood of the trajectory is one point conve Having a one point conve trajectory for 5 trials does not suffice to show SGD always has a simple and easy trajectory, due to the randomness of the stochastic gradient. Indeed, by a slight random perturbation, SGD might be in a completely different trajectory that is far from being one point conve to the final solution. However, in this subsection, we show that it is not the case, as the SGD trajectory is one point conve after convolving with uniform ball with 5 Similar observations were implicitly observed previously [8]. 10

11 Loss value of different local minima Densenet_Cifar10_train_loss Densenet_Cifar10_val_loss Resnet_Cifar10_train_loss Resnet_Cifar10_val_loss The starting epoch of the obtained local minimum Loss value of different local minima Densenet_Cifar100_train_loss Densenet_Cifar100_val_loss Resnet_Cifar100_train_loss Resnet_Cifar100_val_loss The starting epoch of the obtained local minimum Distance to the starting point Densenet_Cifar10 Resnet_Cifar10 Densenet_Cifar100 Resnet_Cifar The starting epoch of the obtained local minimum (a) Loss value of different local minima on Cifar10 (b) Loss value of different local minima on Cifar100 (c) Distance from the local minima to the initialization Figure 6: Spectrum of local minima on the loss surface on modern neural networks. radius 0.5. That means, the whole neighborhood of the SGD trajectory is one point conve with respect to the final solution. In this eperiment, we tried Resnet (34 layers, 1.M parameters), Densenet (100 layers, 0.8M parameters) on cifar10 and cifar For every epoch in each setting, we take one point and look at its neighborhood with radius 0.5 (upper bound of the length of one SGD step, as we will show below). We take 100 random points inside each neighborhood to verify Assumption 1 7. More specifically, for every random point w in the neighborhood of t, we computer f(w), 300 t. Figure 5b shows the mean value (solid line), as well as upper and lower bound of the inner product (shaded area). As we can see, the inner products for all epochs in every setting have small variances, and are always positive. Although we could not verify Assumption 1 by computing the eact epectation due to limited computational resources, from Figure 5b and Hoeffding bound (Lemma 6), we conclude that Assumption 1 should hold with high probability. Lemma 6 (Hoeffding bound [1]). Let X 1,..., X n be i.i.d. random variables bounded by the interval [a, b]. Then Pr ( 1 n i X i E[X 1 ] t ) ( ep ). nt (b a) Figure 5c shows the norm of the stochastic gradients, including both the mean value (solid lines), as well as upper and lower bounds (shaped area). For all settings, the stochastic gradients are always less than 5 before epoch 150 with learning rate 0.1, and less than 15 afterwards with learning rate 1. Therefore, multiplying step size with gradient norm, we know SGD step length is always bounded by 0.5. Notice that the gradient norm gets bigger when we get closer to the final solution (after epoch 150). This further eplains why shrinking step size is important. 5.3 Loss surface is locally a slope Even with the observation that the whole neighborhood along the SGD trajectory is one point conve with respect to the final solution, there eists a chicken-and-egg concern, as the final target is generated using the SGD trajectory. 6 We also tried VGG with 1M parameters, but does not have similar observations. This might be why Resnet and Densenet are slightly easier to optimize. 7 We also tried to sample points that are one SGD step away to represent the neighborhood, and got similar observations. 11

12 In this subsection, we show that the one point conveity is a pretty global property. We were running Resnet and Densenet on Cifar10, but with smaller networks (each with about 10K parameters). For each network, if we fi the first 10 epochs, and generate 50 SGD trajectories with different random seeds for 140 epochs and 0.1 learning rate, we get 50 different final solutions (they are pretty far away from each other, with minimum pairwise distance 40). For each network, if we look at the inner product between the negative gradient of any epoch of any trajectories, and the vector pointing to any final solutions, we find that the inner products are almost always positive. (only 0.1% of the inner products are not positive for Densenet, and only out of 343, 000 inner products are not positive for Resnet). This indicates that the loss surface is skewed to the similar direction, and our observation that the whole SGD trajectory is one point conve w.r. to the last point is not a coincidence. Based on our Theorem 1, such loss surface is very friendly to SGD optimization, even with a few eceptional points that are not one point conve with respect to the final solution. Notice that in general, it is not possible that all the negative gradients of all points are one point conve with respect to multiple target points. For eample, if we take 1D interpolation between any two target points, we could easily find points that have negative gradients only pointing to one target point. However, based on our simulation, empirically SGD almost never traverse those regions. 5.4 Spectrum of the local minima From the previous subsections, we know that the loss surface of neural network has great one point conve properties. It seems that by our Theorem 1, SGD will almost always converge to a few target points (or regions). However, empirically SGD converges to very different target points. In this subsection, we argue that this is because of the learning rate is too big for SGD to converge (Theorem 3). On the other hand, whenever we shrink the learning rate to 1, Theorem 1 immediately applies and SGD converges to a local minimum. In this eperiment, we were running smaller version of Resnet and Densenet (each with about 10K parameters) on Cifar10 and Cifar100. For each setting, we first train the network with step size 0.1 for 300 epochs, then we pick different epochs as the new starting points for finding nearby local minima using smaller learning rates with additional 150 epochs. See Figure 6a and Figure 6b. Starting from different epochs, we got local minima with decreasing validation loss and training loss. To show that these local minima are not from the same region, we also plot the distance of the local minima to the (unique) initialized point. As we can see, as we pick later epochs as the starting points, we get local minima that are farther away from the initialization with better quality (also observed in [13]). Furthermore, we observe that for every local minimum, the whole trajectory is always one point conve to that local minimum. Therefore, the time for shrinking learning rate decides the quality of the final local minimum. That is, using large step size initially avoids being trapped into a bad local minimum, and whenever we are distant enough from the initialization, we can shrink the step size and converge to a good local minimum (due to one point conveity by Theorem 1). 1

13 6 Conclusion In this paper, we take an alternative view of SGD that it is working on the convolved version of the loss function. Under this view, we could show that when the convolved function is one point conve with respect to the final solution, SGD could escape all the other local minima and stay around with constant probability. To show our assumption is reasonable, we look at the loss surface of modern neural networks, and find that SGD trajectory has nice local one point conve properties, therefore the loss surface is very friend to SGD optimization. It remains an interesting open question to prove local one point conve property for deep neural networks. Acknowledgement The authors want to thank Zhishen Huang for pointing out a mistake in an early version of this paper, and want to thank Gao Huang, Kilian Weinberger, Jorge Nocedal, Ruoyu Sun, Dylan Foster and Aleksander Madry for helpful discussions. This project is supported by a Microsoft Azure research award and Amazon AWS research award. References [1] Zeyuan Allen-Zhu and Yang Yuan. Improved SVRG for non-strongly-conve or sum-ofnon-conve objectives. In ICML 016, volume 48, pages , 016. [] P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina. Entropy-SGD: Biasing Gradient Descent Into Wide Valleys. ArXiv e-prints, November 016. [3] P. Chaudhari and S. Soatto. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. ArXiv e-prints, October 017. [4] Aaron Defazio, Francis R. Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly conve composite objectives. In NIPS 014, pages , 014. [5] L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio. Sharp Minima Can Generalize For Deep Nets. ArXiv e-prints, March 017. [6] John C. Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 1:11 159, 011. [7] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points - online stochastic gradient for tensor decomposition. In COLT 015, volume 40, pages , 015. [8] Ian J. Goodfellow and Oriol Vinyals. Qualitatively characterizing neural network optimization problems. CoRR, abs/ ,

14 [9] M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. ArXiv e-prints, September 015. [10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages , 016. [11] Sepp Hochreiter and Jürgen Schmidhuber. Simplifying neural nets by discovering flat minima. In Advances in Neural Information Processing Systems 7, pages MIT Press, [1] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13 30, [13] Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In I. Guyon, U. V. Luburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages Curran Associates, Inc., 017. [14] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely Connected Convolutional Networks. ArXiv e-prints, August 016. [15] Gao Huang, Yiuan Li, Geoff Pleiss, Zhuang Liu, John E. Hopcroft, and Kilian Q. Weinberger. Snapshot ensembles: Train 1, get m for free. In ICLR 017, 017. [16] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to escape saddle points efficiently. CoRR, abs/ , 017. [17] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS 013, pages , 013. [18] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In ICLR 017, 017. [19] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/ , 014. [0] Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with relu activation. In NIPS 017, 017. [1] Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with restarts. In ICLR 017, 017. [] S. Mandt, M. D. Hoffman, and D. M. Blei. Stochastic Gradient Descent as Approimate Bayesian Inference. ArXiv e-prints, April 017. [3] W. Mou, L. Wang, X. Zhai, and K. Zheng. Generalization Bounds of SGLD for Non-conve Learning: Two Theoretical Viewpoints. ArXiv e-prints, July 017. [4] Yurii Nesterov. Introductory Lectures on Conve Optimization: A Basic Course. Springer Publishing Company, Incorporated, 1 edition,

15 [5] V. Patel. The Impact of Local Geometry and Batch Size on the Convergence and Divergence of Stochastic Gradient Descent. ArXiv e-prints, September 017. [6] Mark Schmidt, Nicolas Le Rou, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, pages 1 30, 016. [7] S. L. Smith and Q. V. Le. A Bayesian Perspective on Generalization and Stochastic Gradient Descent. ArXiv e-prints, October 017. [8] Ilya Sutskever, James Martens, George E. Dahl, and Geoffrey E. Hinton. On the importance of initialization and momentum in deep learning. In ICML, pages , 013. [9] Yuchen Zhang, Percy Liang, and Moses Charikar. A hitting time analysis of stochastic gradient langevin dynamics. CoRR, abs/ ,

16 A Discussions on one point conveity If f is δ-one point strongly conve around in a conve domain D, then is the only local minimum point in D (i.e., global minimum). To see this, for any fied D, look at the function g(t) = f(t + (1 t)) for t [0, 1], then g (t) = f(t + (1 t)),. The definition of δ-one point strongly conve implies that the right side is negative for t (0, 1]. Therefore, g(t) > g(1) for t > 0. This implies that for every point y on the line segment joining to, we have f(y) > f( ), so is the only local minimum point. B Proof for Lemma 5 Proof. Recall that we want to show (3.5η r + 7ηrδ) log 1 1b (ζ) + ηc δ = µ b = µ η r (1 + ηl) On the left hand side there are three summands. Below we show that each of them is bounded by µ b 3 8. Since µ ma{8, 4 log 1 (ζ)}, we know 1b 63b 3 < 8 b 3 µ b 3. Net, we have 4 log 1 (ζ) µ 30 log 1 (ζ)η 0.5 c 0.5 µ 15 log 1 (ζ) µ η 0.5 c c log 1 (ζ) µ η η 1.5 r c log 1 (ζ) µ η r η r ηc log 1 (ζ) µ η r (1 + ηl) 3 Finally, 4 log 1 (ζ) µ 4 c log 1 (ζ) µ 1 c 7 ηr ηc log 1 (ζ) µ 8 We made no effort to optimize the constants here. η r (1+ηL) ηc 3 16

17 7 ηr log 1 δ (ζ) ηc 3 7 ηrδ ηc log 1 (ζ) δ 3 Adding the three summands together, we get the claim. C Proof for Theorem 3 Proof. Recall that we have t+1 = t η f( t ). Since we have f( t ), t c t, then t+1 = t η f( t ) = t + η f( t ) η f( t ), t (1 ηc ) t + η f( t ) > t Where the last inequality holds since we know η > c t. f( t) 17

Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima J. Nocedal with N. Keskar Northwestern University D. Mudigere INTEL P. Tang INTEL M. Smelyanskiy INTEL 1 Initial Remarks SGD

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes

Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes Daniel M. Roy University of Toronto; Vector Institute Joint work with Gintarė K. Džiugaitė University

More information

ONLINE VARIANCE-REDUCING OPTIMIZATION

ONLINE VARIANCE-REDUCING OPTIMIZATION ONLINE VARIANCE-REDUCING OPTIMIZATION Nicolas Le Roux Google Brain nlr@google.com Reza Babanezhad University of British Columbia rezababa@cs.ubc.ca Pierre-Antoine Manzagol Google Brain manzagop@google.com

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Mini-Course 1: SGD Escapes Saddle Points

Mini-Course 1: SGD Escapes Saddle Points Mini-Course 1: SGD Escapes Saddle Points Yang Yuan Computer Science Department Cornell University Gradient Descent (GD) Task: min x f (x) GD does iterative updates x t+1 = x t η t f (x t ) Gradient Descent

More information

A picture of the energy landscape of! deep neural networks

A picture of the energy landscape of! deep neural networks A picture of the energy landscape of! deep neural networks Pratik Chaudhari December 15, 2017 UCLA VISION LAB 1 Dy (x; w) = (w p (w p 1 (... (w 1 x))...)) w = argmin w (x,y ) 2 D kx 1 {y =i } log Dy i

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

FreezeOut: Accelerate Training by Progressively Freezing Layers

FreezeOut: Accelerate Training by Progressively Freezing Layers FreezeOut: Accelerate Training by Progressively Freezing Layers Andrew Brock, Theodore Lim, & J.M. Ritchie School of Engineering and Physical Sciences Heriot-Watt University Edinburgh, UK {ajb5, t.lim,

More information

Nonlinear Optimization Methods for Machine Learning

Nonlinear Optimization Methods for Machine Learning Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks

More information

The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems

The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems Weinan E 1 and Bing Yu 2 arxiv:1710.00211v1 [cs.lg] 30 Sep 2017 1 The Beijing Institute of Big Data Research,

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

FIXING WEIGHT DECAY REGULARIZATION IN ADAM

FIXING WEIGHT DECAY REGULARIZATION IN ADAM FIXING WEIGHT DECAY REGULARIZATION IN ADAM Anonymous authors Paper under double-blind review ABSTRACT We note that common implementations of adaptive gradient algorithms, such as Adam, limit the potential

More information

arxiv: v2 [stat.ml] 18 Jun 2017

arxiv: v2 [stat.ml] 18 Jun 2017 FREEZEOUT: ACCELERATE TRAINING BY PROGRES- SIVELY FREEZING LAYERS Andrew Brock, Theodore Lim, & J.M. Ritchie School of Engineering and Physical Sciences Heriot-Watt University Edinburgh, UK {ajb5, t.lim,

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017 Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

arxiv: v1 [cs.lg] 7 Jan 2019

arxiv: v1 [cs.lg] 7 Jan 2019 Generalization in Deep Networks: The Role of Distance from Initialization arxiv:1901672v1 [cs.lg] 7 Jan 2019 Vaishnavh Nagarajan Computer Science Department Carnegie-Mellon University Pittsburgh, PA 15213

More information

arxiv: v4 [math.oc] 5 Jan 2016

arxiv: v4 [math.oc] 5 Jan 2016 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The

More information

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization References --- a tentative list of papers to be mentioned in the ICML 2017 tutorial Recent Advances in Stochastic Convex and Non-Convex Optimization Disclaimer: in a quite arbitrary order. 1. [ShalevShwartz-Zhang,

More information

Nostalgic Adam: Weighing more of the past gradients when designing the adaptive learning rate

Nostalgic Adam: Weighing more of the past gradients when designing the adaptive learning rate Nostalgic Adam: Weighing more of the past gradients when designing the adaptive learning rate Haiwen Huang School of Mathematical Sciences Peking University, Beijing, 100871 smshhw@pku.edu.cn Chang Wang

More information

arxiv: v1 [cs.lg] 16 Jun 2017

arxiv: v1 [cs.lg] 16 Jun 2017 L 2 Regularization versus Batch and Weight Normalization arxiv:1706.05350v1 [cs.lg] 16 Jun 2017 Twan van Laarhoven Institute for Computer Science Radboud University Postbus 9010, 6500GL Nijmegen, The Netherlands

More information

LARGE BATCH TRAINING OF CONVOLUTIONAL NET-

LARGE BATCH TRAINING OF CONVOLUTIONAL NET- LARGE BATCH TRAINING OF CONVOLUTIONAL NET- WORKS WITH LAYER-WISE ADAPTIVE RATE SCALING Anonymous authors Paper under double-blind review ABSTRACT A common way to speed up training of deep convolutional

More information

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer Agenda 1. Learning as mathematical optimization Stochastic optimization, ERM,

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

SGD CONVERGES TO GLOBAL MINIMUM IN DEEP LEARNING VIA STAR-CONVEX PATH

SGD CONVERGES TO GLOBAL MINIMUM IN DEEP LEARNING VIA STAR-CONVEX PATH Under review as a conference paper at ICLR 9 SGD CONVERGES TO GLOAL MINIMUM IN DEEP LEARNING VIA STAR-CONVEX PATH Anonymous authors Paper under double-blind review ASTRACT Stochastic gradient descent (SGD)

More information

arxiv: v2 [cs.lg] 23 May 2017

arxiv: v2 [cs.lg] 23 May 2017 Shake-Shake regularization Xavier Gastaldi xgastaldi.mba2011@london.edu arxiv:1705.07485v2 [cs.lg] 23 May 2017 Abstract The method introduced in this paper aims at helping deep learning practitioners faced

More information

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation

More information

Don t Decay the Learning Rate, Increase the Batch Size. Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. Le December 9 th 2017

Don t Decay the Learning Rate, Increase the Batch Size. Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. Le December 9 th 2017 Don t Decay the Learning Rate, Increase the Batch Size Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. Le December 9 th 2017 slsmith@ Google Brain Three related questions: What properties control generalization?

More information

Bregman Divergence and Mirror Descent

Bregman Divergence and Mirror Descent Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Overparametrization for Landscape Design in Non-convex Optimization

Overparametrization for Landscape Design in Non-convex Optimization Overparametrization for Landscape Design in Non-convex Optimization Jason D. Lee University of Southern California September 19, 2018 The State of Non-Convex Optimization Practical observation: Empirically,

More information

Stochastic Optimization Methods for Machine Learning. Jorge Nocedal

Stochastic Optimization Methods for Machine Learning. Jorge Nocedal Stochastic Optimization Methods for Machine Learning Jorge Nocedal Northwestern University SIAM CSE, March 2017 1 Collaborators Richard Byrd R. Bollagragada N. Keskar University of Colorado Northwestern

More information

UNDERSTANDING LOCAL MINIMA IN NEURAL NET-

UNDERSTANDING LOCAL MINIMA IN NEURAL NET- UNDERSTANDING LOCAL MINIMA IN NEURAL NET- WORKS BY LOSS SURFACE DECOMPOSITION Anonymous authors Paper under double-blind review ABSTRACT To provide principled ways of designing proper Deep Neural Network

More information

Theory of Deep Learning IIb: Optimization Properties of SGD

Theory of Deep Learning IIb: Optimization Properties of SGD CBMM Memo No. 72 December 27, 217 Theory of Deep Learning IIb: Optimization Properties of SGD by Chiyuan Zhang 1 Qianli Liao 1 Alexander Rakhlin 2 Brando Miranda 1 Noah Golowich 1 Tomaso Poggio 1 1 Center

More information

Local Affine Approximators for Improving Knowledge Transfer

Local Affine Approximators for Improving Knowledge Transfer Local Affine Approximators for Improving Knowledge Transfer Suraj Srinivas & François Fleuret Idiap Research Institute and EPFL {suraj.srinivas, francois.fleuret}@idiap.ch Abstract The Jacobian of a neural

More information

Normalization Techniques in Training of Deep Neural Networks

Normalization Techniques in Training of Deep Neural Networks Normalization Techniques in Training of Deep Neural Networks Lei Huang ( 黄雷 ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th,

More information

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal Sub-Sampled Newton Methods for Machine Learning Jorge Nocedal Northwestern University Goldman Lecture, Sept 2016 1 Collaborators Raghu Bollapragada Northwestern University Richard Byrd University of Colorado

More information

On the fast convergence of random perturbations of the gradient flow.

On the fast convergence of random perturbations of the gradient flow. On the fast convergence of random perturbations of the gradient flow. Wenqing Hu. 1 (Joint work with Chris Junchi Li 2.) 1. Department of Mathematics and Statistics, Missouri S&T. 2. Department of Operations

More information

Deep Learning II: Momentum & Adaptive Step Size

Deep Learning II: Momentum & Adaptive Step Size Deep Learning II: Momentum & Adaptive Step Size CS 760: Machine Learning Spring 2018 Mark Craven and David Page www.biostat.wisc.edu/~craven/cs760 1 Goals for the Lecture You should understand the following

More information

Lecture 6 Optimization for Deep Neural Networks

Lecture 6 Optimization for Deep Neural Networks Lecture 6 Optimization for Deep Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 12, 2017 Things we will look at today Stochastic Gradient Descent Things

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Steve Renals Machine Learning Practical MLP Lecture 5 16 October 2018 MLP Lecture 5 / 16 October 2018 Deep Neural Networks

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

Unraveling the mysteries of stochastic gradient descent on deep neural networks

Unraveling the mysteries of stochastic gradient descent on deep neural networks Unraveling the mysteries of stochastic gradient descent on deep neural networks Pratik Chaudhari UCLA VISION LAB 1 The question measures disagreement of predictions with ground truth Cat Dog... x = argmin

More information

arxiv: v2 [stat.ml] 1 Jan 2018

arxiv: v2 [stat.ml] 1 Jan 2018 Train longer, generalize better: closing the generalization gap in large batch training of neural networks arxiv:1705.08741v2 [stat.ml] 1 Jan 2018 Elad Hoffer, Itay Hubara, Daniel Soudry Technion - Israel

More information

How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India

How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India Chi Jin UC Berkeley Michael I. Jordan UC Berkeley Rong Ge Duke Univ. Sham M. Kakade U Washington Nonconvex optimization

More information

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley Collaborators Joint work with Samy Bengio, Moritz Hardt, Michael Jordan, Jason Lee, Max Simchowitz,

More information

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Hiroaki Hayashi 1,* Jayanth Koushik 1,* Graham Neubig 1 arxiv:1611.01505v3 [cs.lg] 11 Jun 2018 Abstract Adaptive

More information

On the Local Minima of the Empirical Risk

On the Local Minima of the Empirical Risk On the Local Minima of the Empirical Risk Chi Jin University of California, Berkeley chijin@cs.berkeley.edu Rong Ge Duke University rongge@cs.duke.edu Lydia T. Liu University of California, Berkeley lydiatliu@cs.berkeley.edu

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee227c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee227c@berkeley.edu

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

A Conservation Law Method in Optimization

A Conservation Law Method in Optimization A Conservation Law Method in Optimization Bin Shi Florida International University Tao Li Florida International University Sundaraja S. Iyengar Florida International University Abstract bshi1@cs.fiu.edu

More information

arxiv: v2 [cs.lg] 14 Feb 2018

arxiv: v2 [cs.lg] 14 Feb 2018 Ilya Loshchilov 1 Frank Hutter 1 arxiv:1711.05101v2 [cs.lg] 14 Feb 2018 Abstract L 2 regularization and weight decay regularization are equivalent for standard stochastic gradient descent (when rescaled

More information

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on Overview of gradient descent optimization algorithms HYUNG IL KOO Based on http://sebastianruder.com/optimizing-gradient-descent/ Problem Statement Machine Learning Optimization Problem Training samples:

More information

A random perturbation approach to some stochastic approximation algorithms in optimization.

A random perturbation approach to some stochastic approximation algorithms in optimization. A random perturbation approach to some stochastic approximation algorithms in optimization. Wenqing Hu. 1 (Presentation based on joint works with Chris Junchi Li 2, Weijie Su 3, Haoyi Xiong 4.) 1. Department

More information

Incremental Reshaped Wirtinger Flow and Its Connection to Kaczmarz Method

Incremental Reshaped Wirtinger Flow and Its Connection to Kaczmarz Method Incremental Reshaped Wirtinger Flow and Its Connection to Kaczmarz Method Huishuai Zhang Department of EECS Syracuse University Syracuse, NY 3244 hzhan23@syr.edu Yingbin Liang Department of EECS Syracuse

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

ON THE INEFFECTIVENESS OF VARIANCE REDUCED OPTIMIZATION FOR DEEP LEARNING

ON THE INEFFECTIVENESS OF VARIANCE REDUCED OPTIMIZATION FOR DEEP LEARNING ON THE INEFFECTIVENESS OF VARIANCE REDUCED OPTIMIZATION FOR DEEP LEARNING Anonymous authors Paper under double-blind review ABSTRACT The application of stochastic variance reduction to optimization has

More information

Foundations of Deep Learning: SGD, Overparametrization, and Generalization

Foundations of Deep Learning: SGD, Overparametrization, and Generalization Foundations of Deep Learning: SGD, Overparametrization, and Generalization Jason D. Lee University of Southern California November 13, 2018 Deep Learning Single Neuron x σ( w, x ) ReLU: σ(z) = [z] + Figure:

More information

Advanced computational methods X Selected Topics: SGD

Advanced computational methods X Selected Topics: SGD Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety

More information

Information theoretic perspectives on learning algorithms

Information theoretic perspectives on learning algorithms Information theoretic perspectives on learning algorithms Varun Jog University of Wisconsin - Madison Departments of ECE and Mathematics Shannon Channel Hangout! May 8, 2018 Jointly with Adrian Tovar-Lopez

More information

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate 58th Annual IEEE Symposium on Foundations of Computer Science First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate Zeyuan Allen-Zhu Microsoft Research zeyuan@csail.mit.edu

More information

Adam: A Method for Stochastic Optimization

Adam: A Method for Stochastic Optimization Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Content Background Supervised ML theory and the importance of optimum finding Gradient descent and its variants Limitations

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Adaptive Gradient Methods AdaGrad / Adam

Adaptive Gradient Methods AdaGrad / Adam Case Study 1: Estimating Click Probabilities Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 The Problem with GD (and SGD)

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 10-708 Probabilistic Graphical Models Homework 3 (v1.1.0) Due Apr 14, 7:00 PM Rules: 1. Homework is due on the due date at 7:00 PM. The homework should be submitted via Gradescope. Solution to each problem

More information

OPTIMIZATION METHODS IN DEEP LEARNING

OPTIMIZATION METHODS IN DEEP LEARNING Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate

More information

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima Third-order Smoothness elps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima Yaodong Yu and Pan Xu and Quanquan Gu arxiv:171.06585v1 [math.oc] 18 Dec 017 Abstract We propose stochastic

More information

Tips for Deep Learning

Tips for Deep Learning Tips for Deep Learning Recipe of Deep Learning Step : define a set of function Step : goodness of function Step 3: pick the best function NO Overfitting! NO YES Good Results on Testing Data? YES Good Results

More information

CSC321 Lecture 7: Optimization

CSC321 Lecture 7: Optimization CSC321 Lecture 7: Optimization Roger Grosse Roger Grosse CSC321 Lecture 7: Optimization 1 / 25 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information

CSC321 Lecture 8: Optimization

CSC321 Lecture 8: Optimization CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information

THERE ARE MANY CONSISTENT EXPLANATIONS OF UNLABELED DATA: WHY YOU SHOULD AVERAGE

THERE ARE MANY CONSISTENT EXPLANATIONS OF UNLABELED DATA: WHY YOU SHOULD AVERAGE THERE ARE MANY CONSISTENT EXPLANATIONS OF UNLABELED DATA: WHY YOU SHOULD AVERAGE Anonymous authors Paper under double-blind review Abstract Presently the most successful approaches to semi-supervised learning

More information

A BAYESIAN PERSPECTIVE ON GENERALIZATION AND STOCHASTIC GRADIENT DESCENT

A BAYESIAN PERSPECTIVE ON GENERALIZATION AND STOCHASTIC GRADIENT DESCENT A BAYESIAN PERSPECTIVE ON GENERALIZATION AND STOCHASTIC GRADIENT DESCENT Samuel L. Smith & Quoc V. Le Google Brain {slsmith, qvl}@google.com ABSTRACT We consider two questions at the heart of machine learning;

More information

Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization

Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization Proceedings of the hirty-first AAAI Conference on Artificial Intelligence (AAAI-7) Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization Zhouyuan Huo Dept. of Computer

More information

Normalized Gradient with Adaptive Stepsize Method for Deep Neural Network Training

Normalized Gradient with Adaptive Stepsize Method for Deep Neural Network Training Normalized Gradient with Adaptive Stepsize Method for Deep Neural Network raining Adams Wei Yu, Qihang Lin, Ruslan Salakhutdinov, and Jaime Carbonell School of Computer Science, Carnegie Mellon University

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

Sparse Regularized Deep Neural Networks For Efficient Embedded Learning

Sparse Regularized Deep Neural Networks For Efficient Embedded Learning Sparse Regularized Deep Neural Networks For Efficient Embedded Learning Anonymous authors Paper under double-blind review Abstract Deep learning is becoming more widespread in its application due to its

More information

Are ResNets Provably Better than Linear Predictors?

Are ResNets Provably Better than Linear Predictors? Are ResNets Provably Better than Linear Predictors? Ohad Shamir Department of Computer Science and Applied Mathematics Weizmann Institute of Science Rehovot, Israel ohad.shamir@weizmann.ac.il Abstract

More information

Hamiltonian Descent Methods

Hamiltonian Descent Methods Hamiltonian Descent Methods Chris J. Maddison 1,2 with Daniel Paulin 1, Yee Whye Teh 1,2, Brendan O Donoghue 2, Arnaud Doucet 1 Department of Statistics University of Oxford 1 DeepMind London, UK 2 The

More information

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Finite-sum Composition Optimization via Variance Reduced Gradient Descent Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu

More information

Negative Momentum for Improved Game Dynamics

Negative Momentum for Improved Game Dynamics Negative Momentum for Improved Game Dynamics Gauthier Gidel Reyhane Askari Hemmat Mohammad Pezeshki Gabriel Huang Rémi Lepriol Simon Lacoste-Julien Ioannis Mitliagkas Mila & DIRO, Université de Montréal

More information

The K-FAC method for neural network optimization

The K-FAC method for neural network optimization The K-FAC method for neural network optimization James Martens Thanks to my various collaborators on K-FAC research and engineering: Roger Grosse, Jimmy Ba, Vikram Tankasali, Matthew Johnson, Daniel Duckworth,

More information

arxiv: v1 [cs.lg] 10 Jan 2019

arxiv: v1 [cs.lg] 10 Jan 2019 Quantized Epoch-SGD for Communication-Efficient Distributed Learning arxiv:1901.0300v1 [cs.lg] 10 Jan 2019 Shen-Yi Zhao Hao Gao Wu-Jun Li Department of Computer Science and Technology Nanjing University,

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

Normalization Techniques

Normalization Techniques Normalization Techniques Devansh Arpit Normalization Techniques 1 / 39 Table of Contents 1 Introduction 2 Motivation 3 Batch Normalization 4 Normalization Propagation 5 Weight Normalization 6 Layer Normalization

More information

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

SHAKE-SHAKE REGULARIZATION OF 3-BRANCH

SHAKE-SHAKE REGULARIZATION OF 3-BRANCH SHAKE-SHAKE REGULARIZATION OF 3-BRANCH RESIDUAL NETWORKS Xavier Gastaldi xgastaldi.mba2011@london.edu ABSTRACT The method introduced in this paper aims at helping computer vision practitioners faced with

More information

Stochastic Gradient Descent. CS 584: Big Data Analytics

Stochastic Gradient Descent. CS 584: Big Data Analytics Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration

More information

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Based on the original slides of Hung-yi Lee

Based on the original slides of Hung-yi Lee Based on the original slides of Hung-yi Lee Google Trends Deep learning obtains many exciting results. Can contribute to new Smart Services in the Context of the Internet of Things (IoT). IoT Services

More information

IMPROVING STOCHASTIC GRADIENT DESCENT

IMPROVING STOCHASTIC GRADIENT DESCENT IMPROVING STOCHASTIC GRADIENT DESCENT WITH FEEDBACK Jayanth Koushik & Hiroaki Hayashi Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213, USA {jkoushik,hiroakih}@cs.cmu.edu

More information