Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Jacob Rafati http://rafati.net jrafatiheravi@ucmerced.edu Ph.D. Candidate, Electrical Engineering and Computer Science University of California, Merced

Agenda Introduction, Problem Statement and Motivations Overview on Quasi-Newton Optimization Methods L-BFGS Trust Region Optimization Method Proposed Methods for Initialization of L-BFGS Application in Deep Learning (Image Classification Task)

Introduction, Problem Statement and Motivations

Unconstrained Optimization Problem NX min w2r n L(w), 1 N `i(w) i=1 L : R n! R

Optimization Algorithms Bottou et al., (2016) Optimization methods for large-scale machine learning, print arxiv:1606.04838

Optimization Algorithms 1. Start from a random point w 0 2. Repeat each iteration, k =0, 1, 2,..., 3. Choose a search direction p k 4. Choose a step size k 5. Update parameters w k+1 w k + k p k 6. Until krlk <

Properties of Objective Function NX min w2r n L(w), 1 N `i(w) i=1 n and N are both large in modern applications. L(w) is a non-convex and nonlinear function. r 2 L(w) is ill-conditioned. Computing full gradient, rl is expensive. Computing Hessian, r 2 L is not practical.

Stochastic Gradient Decent 1. Sample indices S k {1, 2,...,N} 2. Compute stochastic (subsampled) gradient rl(w k ) rl(w k ) (S k), 1 S k X i2s k r`i(w k ) 3. Assign a learning rate k 4. Update parameters using p k = rl(w k ) (S k) w k+1 w k k rl(w k ) (S k) H. Robbins, D. Siegmund. (1971). A convergence theorem for non negative almost supermartingales and some applications. Optimizing methods in statistics.

Advantages of SGD SGD algorithms are very easy to implement. SGD requires only computing the gradient. SGD has a low cost-per-iteration. Bottou et al., (2016) Optimization methods for large-scale machine learning, print arxiv:1606.04838

Disadvantages of SGD Very sensitive to the ill-conditioning problem and scaling. Requires fine-tuning many hyper-parameters. Unlikely exhibit acceptable performance on first try. requires many trials and errors. Can stuck in a saddle-point instead of local minimum. Sublinear and slow rate of convergence. Bottou et al., (2016). Optimization methods for large-scale machine learning. print arxiv:1606.04838 J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

Second-Order Methods 1. Sample indices S k {1, 2,...,N} 2. Compute stochastic (subsampled) gradient rl(w k ) rl(w k ) (S k), 1 S k X i2s k r`i(w k ) 3. Compute Hessian r 2 L(w k ) r 2 L(w k ) (S k), 1 S k X i2s k r 2`i(w k )

Second-Order Methods 4. Compute Newton s direction p k = r 2 L(w k ) 1 rl(w k ) 5. Find proper step length k =min L(w k + p k ) 6. Update parameters w k+1 w k + k p k

Second-Order Methods Advantages The rate of convergence is super-linear (quadratic for Newton method). They are resilient to problem ill-conditioning. They involve less parameter tuning. They are less sensitive to the choice of hyperparameters. J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

Second-Order Methods Disadvantages Computing the Hessian matrix is very expensive and requires massive storage. Computing the inverse of Hessian is not practical. J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer. Bottou et al., (2016) Optimization methods for large-scale machine learning, print arxiv:1606.04838

Quasi-Newton Methods 1. Construct a low-rank approximation of Hessian B k r 2 L(w k ) 2. Find the search direction by Minimizing the Quadratic Model of the objective function p k = argmin p2r n Q k (p), g T k p + 1 2 pt B k p T

Quasi-Newton Matrices Symmetric Easy and Fast Computation Satisfies Secant Condition B k+1 s k = y k s k, w k+1 w k y k, rl(w k+1 ) rl(w k )

Broyden Fletcher Goldfarb Shanno. J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

Broyden Fletcher Goldfarb Shanno. 1 B k+1 = B k s T k B B k s k s T k B k + 1 ks k y T k s k y k y T k, s k, w k+1 w k y k, rl(w k+1 ) rl(w k ) B 0 = k I J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

Quasi-Newton Methods Advantages The rate of convergence is super-linear. They are resilient to problem ill-conditioning. The second derivative is not required. They only use the gradient information to construct quasi-newton matrices. J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

Quasi-Newton Methods disadvantages The cost of storing the gradient informations can be expensive. The quasi-newton matrix can be dense. The quasi-newton matrix grow in size and rank in large-scale problems. J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

Limited-Memory BFGS Limited Memory Storage S k = s k m... s k 1 Y k = y k m... y k 1 L-BFGS Compact Representation B k = B 0 + km k T k B 0 = k I where apple S T k B 0 S k L k 1 k = B 0 S k Y k, Mk = L T k D k S T k Y k = L k + D k + U k

Limited-Memory Quasi-Newton Methods Low rank approximation Small memory of recent gradients. Low cost of computation of search direction. Linear or superlinear Convergence rate can be achieved. J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

Objectives B 0 = k I B k = B 0 + km k T k What is the best choice for initialization?

Overview on Quasi-Newton Optimization Strategies

Line Search Method w k rl(w k ) p k Quadratic model p k = argmin p2r n Q k (p), g T k p + 1 2 pt B k p T if B k is positive definite: p k = B 1 k g k J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

Line Search Method w k w k+1 k p k p k Wolfe conditions L(w k + k p k ) apple L(w k )+c 1 k rl(w k ) T p k rl(w k + k p k ) T p k c 2 rf(w k ) T p k J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

Trust Region Method p k = arg min p2r n Q(p) s.t. kpk 2 apple k k w k p k Toint et al. (2000), Trust Region Methods, SIAM.

Trust Region Method 6 4 2 Global minimum 0-2 -4 Newton step Local minimum -6-6 -4-2 0 2 4 6 J. J. Mor e and D. C. Sorensen, (1984) Newton s method, in Studies in Mathematics, Volume 24. Studies in Numerical Analysis, Math. Assoc., pp. 29 82.

L-BFGS Trust Region Optimization Method

L-BFGS in Trust Region B k = B 0 + km k T k Eigen-decomposition B k = P apple + k I 0 0 ki P T Sherman-Morrison-Woodbury Formula p k = 1 k + I k( M 1 = k + T k k) 1 T k gk, L. Adhikari et al. (2017) Limited-memory trust-region methods for sparse relaxation, in Proc.SPIE, vol. 10394. Brust et al, (2017). On solving L-SR1 trust-region subproblems, Computational Optimization and Applications, vol. 66, pp. 245 266.

L-BFGS in Trust Region vs. Line-Search 26th European Signal Processing Conference, Rome, Italy. September 2018.

Proposed Methods for Initialization of L-BFGS

Initialization Method I B 0 = k I B k = B 0 + km k T k Spectral estimate of Hessian k = yt k 1 y k 1 s T k 1 y k 1 k = arg min kb 1 0 y k 1 s k 1 k 2 2, B 0 = I J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

Initialization Method II Consider a quadratic function L(w) = 1 2 wt Hw + g T w r 2 L(w) =H We have HS k = Y k Therefore S T k HS k = S T k Y k Erway et al. (2018). Trust-Region Algorithms for Training Responses: Machine Learning Methods Using Indefinite Hessian Approximations, ArXiv e-prints.

Initialization Method II Since B k = B 0 + km k k = B 0 S k Y k, Mk = Secant Condition T k B 0 = apple S T k B 0 S k L k L T D k k k I 1 B k S T k = Y k We have S T k HS k k S T k S k = S T k km k T S k k

Initialization Method II S T k HS k k S T k S k = S T k km k T S k k General Eigen-Value Problem (L k + D k + L T k )z = S T k S k z Upper bound for initial value to avoid false curvature information k 2 (0, min)

Initialization Method III B 0 = k I B k = B 0 + km k T k S T k HS k k S T k S k = S T k km k T S k k Note that compact representation matrices contains k apple S T k B 0 S k L k 1 k = B 0 S k Y k, Mk = L T k D k

Initialization Method III General Eigen-Value Problem A z = B z Upper bound for initial values to avoid false curvature information k 2 (0, min) A = L k +D k +L T k S T k Y k DY T k S k 2 k 1 (S T k S k ÃS T k S k ), B = S T k S k + S T k S k BY T k S T k + S T k Y k BT S T k S k.

Applications in Deep Learning

Supervised Learning Features X = {x 1,x 2,...,x i,...,x N } Labels T = {t 1,t 2,...,t i,...,t N } : X! T

Supervised Learning Input: x Model (Predictor) Output: y (x; w) 0 1 2 3... 9 0 0 0.97 0.03... 0

Convolutional Neural Network Input conv1 pool1 conv2 pool2 hidden4 output (x; w) n = 413, 080 Lecun et al. (1998) Gradient-basedlearning applied to document recognition, Proceedings of the IEEE, vol. 86, no. 11, pp. 2278 2324.

Loss Function Target Output t = 0, 0, 1, 0,..., 0 y = 0, 0, 0.97, 0.03,..., 0 Cross-entropy `(t, y) = t. log(y) (1 t). log(1 y) Empirical Risk NX L(w) = 1 N `(t i, (x i,w)) i=1

Multi-Batch L-BFGS Shuffled Data Shuffled Data O k 1 O k S k O k = S k \ S k+1 O k O k+1 S k+1 O k+1 O k+2 S k+2 Berahas et al. (2016). A multi-batch L-BFGS method for machine learning, in Advances in Neural Information Processing Systems 29, pp. 1055 1063.

Computing gradients O k 1 O k S k O k = S k \ S k+1 O k O k+1 S k+1 g k = rl(w k ) (S k) = 1 S k X i2j k rl i (w k ) y k = rl(w k+1 ) (O k) rl(w k ) (O k) Berahas et al. (2016). A multi-batch L-BFGS method for machine learning, in Advances in Neural Information Processing Systems 29, pp. 1055 1063.

Experiment

Trust-Region Algorithm

Results - Loss Full Stochastic Full Stochastic

Results - Accuracy Full Stochastic Full Stochastic

Results - Training Time Full Stochastic Full Stochastic

Acknowledgement This research is supported by NSF Grants CMMI Award 1333326, IIS 1741490 and ACI 1429783.

Paper, Code and Slides: http://rafati.net jrafatiheravi@ucmerced.edu