Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Size: px

Start display at page:

Download "Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017"

Lizbeth Higgins
5 years ago
Views:

1 Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017

2 Step Sizes and Convergence

3 Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient descent Because we don t have to process the entire training set But converges to a noise ball lim T!1 E kx T x k 2 apple M 2µ µ 2

4 Controlling the accuracy Want the noise ball to be as small as possible for accurate solutions Noise ball proportional to the step size/learning rate lim T!1 E kx T x k 2 apple M 2µ µ 2 = O( ) So should we make the step size as small as possible?

5 Effect of step size on convergence E Let s go back to the convergence rate proof for SGD From the previous lecture, we have E hkx t+1 x k 2 x t i If we re far from the noise ball i.e. hkx t+1 x k 2 x t i apple (1 µ) 2 kx t x k M. kx t x k 2 2 M µ apple (1 µ) 2 kx t x k 2 + µ 2 kx t x k 2.

6 Effect of step size on convergence (continued) E hkx t+1 x k 2 x t i apple (1 µ) 2 kx t x k 2 + µ 2 kx t x k 2 µ apple 1 kx t x k 2 (if µ <1) 2 µ apple exp kx t x k 2. 2 So to contract by a factor of C, we need to run T steps, where 1 = exp µt 2 C, T = 2 µ log C

7 The Full Effect of Step Size Noise ball proportional to the step size lim E kx T x k 2 apple M T!1 2µ µ 2 = O( ) Convergence time inversely proportional to the step size So there s a trade-off! T = 2 µ log C

8 Demo

9 Can we get the best of both worlds? When do we want the step size to be large? At the beginning of of execution? execution! Near the end? Both? When do we want the step size to be small? At the beginning of execution? Near the end? end! Both? What about using a decreasing step size scheme?

10 SGD with Varying Step Size Allow the step size to vary over time x t+1 = x t t rf(x t ; y it ) Turns out this is the standard in basically all machine learning! Two ways to do it: Chosen a priori step sizes step size doesn t depend on measurements Adaptive step sizes choose step size based on measurements & heuristics

11 Optimal Step Sizes for Convex Objectives E Can we use math to choose a step size? Start with our previous bound h kx t+1 x k 2i h apple (1 t µ) 2 E kx t x k 2i + t 2 M h apple (1 t µ)e kx t x k 2i + t 2 M (for t µ<1) Right side is minimized when 0= µe h kx t x k 2i +2 t M, t = µ 2M E h kx t x k 2i

12 Optimal Step Sizes (continued) Let t = E hkx t x k 2i t+1 apple (1 t µ) t + t 2 M µ = 1 2M t µ t + µ 2 = t 2M 2 t + µ2 4M 2 t µ 2 = t 4M 2 t µ 2 2M t M

13 Optimal Step Sizes (continued) Let t = E h kx t x k 2i 1 t+1 (since (1 z) 1 1+z) t µ 2 4M 2 t µ 2 1 = 1 1 t 4M t 1 1+ µ2 4M t t 1 = 1 t + µ2 4M.

14 Optimal Step Sizes (continued) Let t = E hkx t x k 2i µ2 T T 0 4M ) T apple 4M 0 4M + µ 2 0 T Sometimes called a 1/T rate. Slower than the linear rate of gradient descent.

15 Optimal Step Sizes (continued) Substitute back in to find how to set the step size: t = µ 2M 4M 0 4M + µ 2 0 t = 2µ 0 4M + µ 2 0 t = 1 (t) This is a pretty common simple scheme General form is t = 0 1+ t

16 Demo

17 Have we solved step sizes for SGD forever? No. We don t usually know what µ, L, M, and 0 Even if the problem is convex are This optimal rate optimizes the upper bound on the expected distance-squared to the optimum But sometimes this bound is loose, and other step size schemes might do better

18 What if we don t know the parameters? One idea: still use a step size scheme of the form Choose parameters t = 0 1+ t via some other method 0 and t For example, hand-tuning which doesn t scale Can we do this automatically? Yes! This is an example of metaparameter optimization

19 Other Techniques Decrease the step size in epochs Still asymptotically t = 1 (t) but step size decreases in discrete steps Useful for parallelization and saves a little compute Per-parameter learning rates e.g. AdaGrad Replace scalar with diagonal matrix A. Helps with poorly scaled problems x t+1 = x t Arf(x t ; y it )

20 Mini-Batching

21 Gradient Descent vs. SGD Gradient descent: all examples at once x t+1 = x t t 1 N Stochastic gradient descent: one example at a time Is it really all or nothing? Can we do something intermediate? NX i=1 rf(x t ; y i ) x t+1 = x t t rf(x t ; y it )

22 Mini-Batch Stochastic Gradient Descent An intermediate approach x t+1 = x t t 1 B t X where B t is sampled uniformly from the set of all subsets of {1,, N} of size b. The b parameter is the batch size Typically choose b << N. Also called mini-batch gradient descent i2b t rf(x t ; y i )

23 Advantages of Mini-Batch Takes less time to compute each update than gradient descent Only needs to sum up b gradients, rather than N x t+1 = x t t 1 B t X i2b t rf(x t ; y i ) But takes more time for each update than SGD So what s the benefit? It s more like gradient descent, so maybe it converges faster than SGD?

24 Mini-Batch SGD Converges Start by breaking up the update rule into expected update and noise x t+1 x = x t x t (rh(x t ) rh(x )) 1 X t (rf(x t ; y i ) rh(x t )) B t i2b t Variance analysis Var (x t+1 x )=Var t 1 B t X i2b t (rf(x t ; y i ) rh(x t ))!

25 Mini-Batch SGD Converges (continued) Let i = rf(x t ; y i ) rh(x t ), and i = Var (x t+1 x )= 2 t B t 2 Var X = 2 t B t 2 Var = 2 t B t 2 NX i=1 NX j=1 (rf(x t ; y i ) rh(x t )) i2b t! NX i=1 i i i j i j ( 1 i 2 B t 0 i/2 B t!

26 Mini-Batch SGD Converges (continued) Because we sampled B uniformly at random, for i j E [ i j] =P (i 2 B ^ j 2 B) =P (i 2 B) P (j 2 B i 2 B) = b N E 2 b i = P (i 2 B) = N So we can write the variance as 0 Var (x t+1 x )= 2 t B t X i6=j b(b 1) N(N 1) i j + NX i=1 b N 2 i b 1 N 1 1 A

27 Mini-Batch SGD Converges (continued) Var (x t+1 x )= 2 t b 2 = 2 t bn = 2 t bn X i6=j b 1 N 1 b 1 N 1 b(b 1) N(N 1) NX i=1 NX j=1 NX i=1 i i j + i j + NX i=1 NX i=1 b N! 2 + N b N i NX i=1 1 A b 1 N i A 2 i 1 A = 2 t (N b) bn(n 1) NX i=1 2 i

28 Mini-Batch SGD Converges (continued) Var (x t+1 x )= 2 t (N b) b(n 1) 1 N apple 2 t (N b) b(n 1) M apple 2 M t b Compared with SGD, variance decreased by a factor of b NX i=1 = 2 t (N b) h b(n 1) E krf(x t ; y it ) 2 i rh(x t )k 2 x t i

29 Mini-Batch SGD Converges (continued) Recall that SGD converged to a noise ball of size lim E kx T x k 2 apple M T!1 2µ µ 2 Since mini-batching decreases variance by a factor of b, it will have h lim E kx T T!1 Noise ball smaller by the same factor! x k 2i apple M (2µ µ 2 )b

30 Advantages of Mini-Batch (reprise) Takes less time to compute each update than gradient descent Only needs to sum up b gradients, rather than N x t+1 = x t t 1 B t X i2b t rf(x t ; y i ) Converges to a smaller noise ball than stochastic gradient descent h lim E kx T T!1 x k 2i apple M (2µ µ 2 )b

31 How to choose the batch size? Mini-batching is not a free win Naively, compared with SGD, it takes b times as much effort to get a b-times-asaccurate answer But we could have gotten a b-times-as-accurate answer by just running SGD for b times as many steps with a step size of /b. But it still makes sense to run it for systems and statistical reasons Mini-batching exposes more parallelism Mini-batching lets us estimate statistics about the full gradient more accurately Another use case for metaparameter optimization

32 Mini-Batch SGD is very widely used Including in basically all neural network training b = 32 is a typical default value for batch size From Practical Recommendations for Gradient-Based Training of Deep Architectures, Bengio 2012.

33 Overfitting, Generalization Error, and Regularization

34 Minimizing Training Loss is Not our Real Goal Training loss looks like h(x) = 1 N What we actually want to minimize is expected loss on new examples Drawn from some real-world distribution ɸ NX i=1 f(x; y i ) h(x) =E y [f(x; y)] Typically, assume the training examples were drawn from this distribution

35 Overfitting Minimizing the training loss doesn't generally minimize the expected loss on new examples They are two different objective functions after all Difference between the empirical loss on the training set and the expected loss on new examples is called the generalization error Even a model that has high accuracy on the training set can have terrible performance on new examples Phenomenon is called overfitting

36 Demo

37 How to address overfitting Many, many techniques to deal with overfitting Have varying computational costs But this is a systems course so what can we do with little or no extra computational cost? Notice from the demo that some loss functions do better than others Can we modify our loss function to prevent overfitting?

38 Regularization Add an extra regularization term to the objective function Most popular type: L2 regularization h(x) = 1 N NX f(x; y i )+ 2 kxk 2 2 = 1 N i=1 Also popular: L1 regularization h(x) = 1 NX f(x; y i )+ kxk N 1 = 1 N i=1 NX i=1 NX i=1 f(x; y i )+ 2 dx k=1 x 2 i dx f(x; y i )+ x i k=1

39 Benefits of Regularization Cheap to compute For SGD and L2 regularization, there s just an extra scaling x t+1 =(1 2 t 2 )x t t rf(x t ; y it ) Makes the objective strongly convex This makes it easier to get and prove bounds on convergence Helps with overfitting

40 Motivation One way to think about regularization is as a Bayesian prior MLE interpretation of learning problem: P (y i x) = 1 Z exp( f(x; y i)) P (x y) = P (y x) P (x) P (y) Taking the logarithm: log P (x y) = log(z) = P (y x) P (x) P (y) NX i=1 NY i=1 1 Z exp( f(x; y i ) + log P (x) f(x; y i)) log P (y)

41 Motivation (continued) So the MLE problem becomes max x log P (x y) = N X i=1 Now we need to pick a prior probability distribution for x Say we choose a Gaussian prior max x log P (x y) = N X = i=1 f(x; y i ) + log NX f(x; y i ) i=1 f(x; y i ) + log P (x) + (constants) 2 1 p 2 exp!! 2 kxk 2 +(C) 2 2 kxk2 +(C) There s our regularization term!

42 Motivation (continued) By setting a prior, we limit the values we think the model can have This prevents the model from doing bad things to try to fit the data Like the polynomial-fitting example from the demo

43 Demo

44 How to choose the regularization parameter? One way is to use an independent validation set to estimate the test error, and set the regularization parameter manually so that it is high enough to avoid overfitting This is what we saw in the demo But doing this naively can be computationally expensive Need to re-run learning algorithm many times Yet another use case for metaparameter optimization

45 Early Stopping

46 Asymptotically large training sets Setting 1: we have a distribution ɸ and we sample a very large (asymptotically infinite) number of points from it, then run stochastic gradient descent on that training set for only N iterations. Can our algorithm in this setting overfit? No, because its training set is asymptotically equal to the true distribution. Can we compute this efficiently? No, because its training set is asymptotically infinitely large

47 Consider a second setting Setting 1: we have a distribution ɸ and we sample a very large (asymptotically infinite) number of points from it, then run stochastic gradient descent on that training set for only N iterations. Setting 2: we have a distribution ɸ and we sample N points from it, then run stochastic gradient descent using each of these points exactly once. What is the difference between the output of SGD in these two settings? Asymptotically, there s no difference! So SGD in Setting 2 will also never overfit

48 Early Stopping Motivation: if we only use each training example once for SGD, then we can t overfit. So if we only use each example a few times, we probably won t overfit too much. Early stopping: just stop running SGD before it converges.

49 Benefits of Early Stopping Cheap to compute Literally just does less work It seems like the technique was designed to make systems run faster Helps with overfitting

50 How Early to Stop You ll see this in detail in next Wednesday s paper. Yet another application of metaparameter tuning.

51 Questions? Upcoming things Labor day next Monday no lecture Paper Presentation #1 on Wednesday read paper before class

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017

Stochastic Gradient Descent: The Workhorse of Machine Learning CS6787 Lecture 1 Fall 2017 Fundamentals of Machine Learning? Machine Learning in Practice this course What s missing in the basic stuff? Efficiency!