More Optimization. Optimization Methods. Methods

Size: px

Start display at page:

Download "More Optimization. Optimization Methods. Methods"

Sabina Rose
6 years ago
Views:

1 More More Optimization Optimization Methods Methods Yann YannLeCun LeCun Courant CourantInstitute Institute

2 (almost) (almost) everything everything you've you've learned learned about about optimization optimization is... is... (almost) irrelevant to machine learning particularly to large-scale machine learning. OK, that's an exaggeration (with a kernel of truth). Much of optimization theory is about asymptotic convergence speed How many additional significant digits of the solution do you get at each iteration when you approach a solution. Steepest descent algorithms (1 st order) are linear : constant number of additional significant digits per iteration Newton-type algorithms (2 nd order) are super-linear. But in machine learning, we really don't care about optimization accuracy The loss we are minimizing is not the loss we actually want to minimize The loss we are minimizing is measured on the training set, but we are interested in the performance on the test (or validation) set. If we optimize too well, we overfit! The asymptotic convergence speed is irrelevant

3 The The Wall Wall (not (not by by Pink Pink Floyd, Floyd, but but by by Léon Léon Bottou) Bottou) SGD is very fast at first, and very slow as we approach a minimum. 2 nd order methods are (often) slow at first, and fast near a minimum. But the loss on the test set saturates long before the crossover The Wall

4 The The Wall Wall (by (by Léon Léon Bottou, Bottou, not not Pink Pink Floyd) Floyd) SGD is very fast at first, and very slow as we approach a minimum. 2 nd order methods are (often) slow at first, and fast near a minimum. But the loss on the test set saturates long before the crossover The Wall SGD loses The Floor SGD wins

5 Why Why is is SGD SGD so so fast? fast? Why Why is is SGD SGD so so slow? slow? SGD is fast on large datasets because it exploits the redundancy in the data I give you 1,000,000 samples, produced by replicating 10,000 samples 100 times. After 1 epoch, SGD will have done 100 iterations on the 10,000 sample dataset Any batch method will waste 100 times too much computation for each single iteration. SGD is slow near a solution because of gradient noise Example: least square regression Average loss P L (W )= 1 P y i W ' X i 2 i=1 Sample loss Samples have different minima Gradient pulls towards each one Learning rate must decrease Very slow near the minimum But we don't care!

6 Example Example Simple 2-class classification. Classes are drawn from Gaussians distributions Class1: mean=[-0.4, -0.8] Class2: mean=[0.4,0.8] Unit covariance matrix 100 samples Least square regression with y in {-1,+1} L (W )= 1 P i=1 P 1 2 yi W ' X i 2

7 Example Example Batch gradient descent Learning rate = 1.5 Divergence for 2.38

8 Example Example Batch gradient descent Learning rate = 2.5 Divergence for 2.38

9 Example Example Batch gradient descent Learning rate = 0.2 Equivalent to batch learning rate of 20

10 The The convergence convergence of of gradient gradient descent: descent: 1 dimension dimension Update rule: L (W ) L (W ) W t +1 =W t η t L(W t ) W What is the optimal learning rate? L (W ) L (W )

11 The The convergence convergence of of gradient gradient descent descent Taylor expansion of the loss around Wt: L (W ) 1 2 [W W t]' H (W t )[W W t ]+G(W t )[W W t ]+L 0 G: gradient (1xN), H: hessian (NxN) G(W t )= L(W t) W ; H (W t )= 2 L(W t ) W W ' W t W t +1

12 The Theconvergence convergenceof ofgradient gradientdescent descent Taylor: 1 L (W ) [W W t ]' H (W t )[W W t ]+ G(W t )[W W t ]+ L 0 2 At the solution: L (W ) ) L (W =0 W ) L (W W t ] G (W t )=0 =H (W t )[ W W 1 W =W t H (W t ) G(W t ) G( W ) =W t ηopt G (W t ) W G( W t ) slope : H (W t ) 1 ηopt = H (W t ) Wt W

13 Newton's Newton's Algorithm Algorithm At each iteration, solve for W{t+1}: H (W t )[W t +1 W t ] G (W t )=0 Newton's algorithm is impractical for most machine learning tasks It's batch! Each iteration is O(N^3) It only works if the objective is convex and near quadratic If the function is non convex, it can actually go uphill

14 Newton's Newton's Algorithm Algorithm as as a warping warping of of the the space space Newton's algorithm is like GD in a warped space Precomputed warp = preconditioning H =Θ' Λ Λ 2 Θ H 1 =Θ Λ 12 Λ 12 Θ ' W =Θ Λ 12 U 1 L U =Λ 2 Θ' L W δ U = H 1 2 L W δ W = H 1 L W

15 Convergence Convergence of of Gradient Gradient Descent Descent Taylor expansion around a minimum: L (W ) 1 2 [W W ]' H ( W )[W W ]+ L 0 W t +1 =W t η L (W t) W =W t η H (W t W ) W t +1 W =( I η H )(W t W ) This needs to be a contracting operator All the eigenvalues of (I-eta H) must be between -1 and 1

transform the space (change of variable): V t =Θ(W t W ) ; (W t W )=Θ' V t W t

16 Convergence Convergence of of Gradient Gradient Descent Descent Lets' diagonalise H: Θ ' Θ=I H =Θ' ΛΘ ; Λ=Θ H Θ' Θ is a rotation, Λ is diagonal transform the space (change of variable): V t =Θ(W t W ) ; (W t W )=Θ' V t W t +1 W =( I η H )(W t W ) Θ ' V t +1 =(I η H )Θ ' V t V t +1 =Θ( I η H )Θ' V t V t +1 =( I ηλ)v t

17 Convergence Convergence of of Gradient Gradient Descent Descent V t +1 =( I ηλ)v t This is N independent update equations of the form: V t +1, j =(1 η λ j )V t, j These terms are all between -1 and 1 when: η< 2 λ max

18 Example Example Simple 2-class classification. Classes are drawn from Gaussians distributions Class1: mean=[-0.4, -0.8] Class2: mean=[0.4,0.8] Unit covariance matrix 100 samples Least square regression with y in {-1,+1} L (W )= 1 P i=1 P 1 2 yi W ' X i 2 P L (W )= 1 P i=1 [ 1 2 W ' X i X i ' W y i X i ' W +( y i ) 2 ] L (W )=W ' [ 1 2P i =1 P X i X i ' ]W [ 1 P 2P i =1 y i X i ' ]W +[ 1 2P ( y i ) 2 ] i=1 P

19 Problem Problem Batch gradient descent P H = 1 2P i =1 X i X i ' Eigenvalues are 0.83 and The largest eigenvalue limits the maximum learning rate (2.36) But for this learning rate, convergence is slow in the other dimensions Convergence time is proportional to: Convergence time for 1 ηλ min η= 1 λ max : τ = λ max 2 λ min Convergence time is proportional to condition number

20 Making Making the the Hessian Hessian better better conditioned conditioned The Hessian is the covariance matrix of the inputs (in a linear system) P H = 1 2P i =1 X i X i ' Non-zero mean inputs create large eigenvalues! Center the input vars, unless it makes you lose sparsity (see Vowpal Wabbit) Decorrelate the input vars Correlations leads to ill conditioning Decorrelation can be done with a Karhunen-Loève transform (like PCA) Rotate to the eigenspace of the covariance matrix Normalize the variances of the input variables or use per-weight learning rates / diagonal Hessian

21 Classical Classical(non (nonstochastic) stochastic)optimization OptimizationMethods Methods They are occasionally useful, but rarely Linear system with a finite training set stochastic training with mini-batch on certain machines Search for hyper-parameters Optimization over latent variables Conjugate gradient Iteration is O(N) Works with non-quadratic functions Requires a line search Can't work in stochastic mode. Requires mini-batches Levenberg-Marquardt Iteration is O(N^2) (except for diagonal approximation) Full version only practical for small N Works with non-quadratic functions

22 Conjugate Conjugate Gradient Gradient

23 Conjugate Conjugate Gradient Gradient

24 Computing Computing the the Hessian Hessian by by finite finite difference difference (rarely (rarely useful) useful)

25 Gauss-Newton Approximation of of the the Hessian Hessian

26 Computing Computing the the product product of of the the Hessian Hessian by by a Vector Vector

27 Using Using the the Power Power Method Method to to Compute Compute the the Largest Largest Eigenvalue Eigenvalue

28 Using Using the the Power Power Method Method to to Compute Compute the the Largest Largest Eigenvalue Eigenvalue

29 Stochastic Stochastic version version of of the the principal principal eigenvalue eigenvalue computation computation

30 Recipe Recipe to to compute compute the the maximum maximum learning learning rate rate for for SGD SGD

31 Recipe Recipe to to compute compute the the maximum maximum learning learning rate rate for for SGD SGD

32 Recipe Recipe to to compute compute the the maximum maximum learning learning rate rate for for SGD SGD

33 Recipe Recipe to to compute compute the the maximum maximum learning learning rate rate for for SGD SGD

34 Tricks Tricksto tomake MakeStochastic StochasticGradient GradientWork Work Best reads:, Léon Bottou, Genevieve B. Orr and Klaus-Robert Müller: Efficient Backprop, Neural Networks, Tricks of the Trade, Lecture Notes in Computer Science LNCS 1524, Springer Verlag, Léon Bottou: Stochastic Gradient Tricks, Neural Networks, Tricks of the Trade, Reloaded, , Edited by Grégoire Montavon, Genevieve B. Orr and Klaus-Robert Müller, Lecture Notes in Computer Science (LNCS 7700), Springer, Tricks Average SGD: Normalization of inputs All the tricks in Vowpal Wabbit

35 Learning LearningRate RateSchedule Schedule++Average AverageSGD SGD Learning rate schedule (after Wei Xu & Leon Bottou) ηt =η0 /(1+ λ η0 t)0.75 Average SGD: time averaging (Polyak & Juditsky 1992) Report the average of the last several weight vectors e.g. by using a running average

36 Computing Computing the the Diagonal Diagonal Hessian Hessian in in Complicated Complicated Structures Structures

37 Computing Computing the the Diagonal Diagonal Hessian Hessian in in Complicated Complicated Structures Structures

38 Stochastic StochasticDiagonal DiagonalLevenberg-Marquardt Levenberg-Marquardt

Optimization Methods Non-Linear/Non-Convex. Non-Linear/Non-Convex Learning Problems

Optimization Methods Non-Linear/Non-Convex. Non-Linear/Non-Convex Learning Problems Optimization Optimization Methods Methods for for Non-Linear/Non-Convex Non-Linear/Non-Convex Learning Learning Problems Problems Yann YannLeCun LeCun Courant CourantInstitute Institute http://yann.lecun.com