Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Similar documents
Improving L-BFGS Initialization For Trust-Region Methods In Deep Learning

Trust-Region Optimization Methods Using Limited-Memory Symmetric Rank-One Updates for Off-The-Shelf Machine Learning

Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno February 6, / 25 (BFG. Limited memory BFGS (L-BFGS)

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

5 Quasi-Newton Methods

Higher-Order Methods

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

LBFGS. John Langford, Large Scale Machine Learning Class, February 5. (post presentation version)

2. Quasi-Newton methods

Nonlinear Optimization Methods for Machine Learning

Optimization II: Unconstrained Multivariable

Stochastic Optimization Algorithms Beyond SG

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

ALGORITHM XXX: SC-SR1: MATLAB SOFTWARE FOR SOLVING SHAPE-CHANGING L-SR1 TRUST-REGION SUBPROBLEMS

Stochastic Quasi-Newton Methods

Optimization Methods for Machine Learning

Optimization II: Unconstrained Multivariable

OPER 627: Nonlinear Optimization Lecture 14: Mid-term Review

Convex Optimization CMU-10725

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

Quasi-Newton methods for minimization

Lecture 1: Supervised Learning

Introduction to Logistic Regression and Support Vector Machine

Second-Order Methods for Stochastic Optimization

Sub-Sampled Newton Methods

ORIE 6326: Convex Optimization. Quasi-Newton Methods

DENSE INITIALIZATIONS FOR LIMITED-MEMORY QUASI-NEWTON METHODS

Methods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent

Incremental Quasi-Newton methods with local superlinear convergence rate

Improved Damped Quasi-Newton Methods for Unconstrained Optimization

Stochastic Optimization Methods for Machine Learning. Jorge Nocedal

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Nonlinear Optimization: What s important?

High Order Methods for Empirical Risk Minimization

Algorithms for Constrained Optimization

Neural Network Training

UNIVERSITY OF CALIFORNIA, MERCED LARGE-SCALE QUASI-NEWTON TRUST-REGION METHODS: HIGH-ACCURACY SOLVERS, DENSE INITIALIZATIONS, AND EXTENSIONS

Optimization and Root Finding. Kurt Hornik

Introduction. New Nonsmooth Trust Region Method for Unconstraint Locally Lipschitz Optimization Problems

Unconstrained optimization

MATH 4211/6211 Optimization Quasi-Newton Method

ECS550NFB Introduction to Numerical Methods using Matlab Day 2

Optimization for neural networks

Chapter 4. Unconstrained optimization

Variable Metric Stochastic Approximation Theory

Regression with Numerical Optimization. Logistic

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

Logistic Regression. COMP 527 Danushka Bollegala

A Stochastic Quasi-Newton Method for Online Convex Optimization

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Seminal papers in nonlinear optimization

Adaptive Gradient Methods AdaGrad / Adam

Deep Learning & Neural Networks Lecture 4

Programming, numerics and optimization

Nonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University.

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Maria Cameron. f(x) = 1 n

High Order Methods for Empirical Risk Minimization

STA141C: Big Data & High Performance Statistical Computing

Optimization 2. CS5240 Theoretical Foundations in Multimedia. Leow Wee Kheng

ECS171: Machine Learning

ABSTRACT 1. INTRODUCTION

Large-scale Stochastic Optimization

Gradient-Based Optimization

CS60021: Scalable Data Mining. Large Scale Machine Learning

Stochastic Gradient Methods for Large-Scale Machine Learning

Data Mining (Mineria de Dades)

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

Comparative study of Optimization methods for Unconstrained Multivariable Nonlinear Programming Problems

Nonlinear Programming

OPER 627: Nonlinear Optimization Lecture 9: Trust-region methods

Accelerating SVRG via second-order information

ECS289: Scalable Machine Learning

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Normalization Techniques in Training of Deep Neural Networks

OPTIMIZATION METHODS IN DEEP LEARNING

A Trust-region-based Sequential Quadratic Programming Algorithm

Quasi-Newton Methods

Empirical Risk Minimization and Optimization

Step lengths in BFGS method for monotone gradients

Term Paper: How and Why to Use Stochastic Gradient Descent?

Approximate Newton Methods and Their Local Convergence

Math 408A: Non-Linear Optimization

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

High Order Methods for Empirical Risk Minimization

Advanced computational methods X Selected Topics: SGD

Selected Topics in Optimization. Some slides borrowed from

Optimization: Nonlinear Optimization without Constraints. Nonlinear Optimization without Constraints 1 / 23

Lecture 18: November Review on Primal-dual interior-poit methods

Announcements Kevin Jamieson

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

AM 205: lecture 19. Last time: Conditions for optimality, Newton s method for optimization Today: survey of optimization methods

Lecture 14: October 17

On Lagrange multipliers of trust region subproblems

Lecture 7 Unconstrained nonlinear programming

Cheng Soon Ong & Christian Walder. Canberra February June 2018

A Quasi-Newton Algorithm for Nonconvex, Nonsmooth Optimization with Global Convergence Guarantees

Why should you care about the solution strategies?

Transcription:

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Jacob Rafati http://rafati.net jrafatiheravi@ucmerced.edu Ph.D. Candidate, Electrical Engineering and Computer Science University of California, Merced

Agenda Introduction, Problem Statement and Motivations Overview on Quasi-Newton Optimization Methods L-BFGS Trust Region Optimization Method Proposed Methods for Initialization of L-BFGS Application in Deep Learning (Image Classification Task)

Introduction, Problem Statement and Motivations

Unconstrained Optimization Problem NX min w2r n L(w), 1 N `i(w) i=1 L : R n! R

Optimization Algorithms Bottou et al., (2016) Optimization methods for large-scale machine learning, print arxiv:1606.04838

Optimization Algorithms 1. Start from a random point w 0 2. Repeat each iteration, k =0, 1, 2,..., 3. Choose a search direction p k 4. Choose a step size k 5. Update parameters w k+1 w k + k p k 6. Until krlk <

Properties of Objective Function NX min w2r n L(w), 1 N `i(w) i=1 n and N are both large in modern applications. L(w) is a non-convex and nonlinear function. r 2 L(w) is ill-conditioned. Computing full gradient, rl is expensive. Computing Hessian, r 2 L is not practical.

Stochastic Gradient Decent 1. Sample indices S k {1, 2,...,N} 2. Compute stochastic (subsampled) gradient rl(w k ) rl(w k ) (S k), 1 S k X i2s k r`i(w k ) 3. Assign a learning rate k 4. Update parameters using p k = rl(w k ) (S k) w k+1 w k k rl(w k ) (S k) H. Robbins, D. Siegmund. (1971). A convergence theorem for non negative almost supermartingales and some applications. Optimizing methods in statistics.

Advantages of SGD SGD algorithms are very easy to implement. SGD requires only computing the gradient. SGD has a low cost-per-iteration. Bottou et al., (2016) Optimization methods for large-scale machine learning, print arxiv:1606.04838

Disadvantages of SGD Very sensitive to the ill-conditioning problem and scaling. Requires fine-tuning many hyper-parameters. Unlikely exhibit acceptable performance on first try. requires many trials and errors. Can stuck in a saddle-point instead of local minimum. Sublinear and slow rate of convergence. Bottou et al., (2016). Optimization methods for large-scale machine learning. print arxiv:1606.04838 J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

Second-Order Methods 1. Sample indices S k {1, 2,...,N} 2. Compute stochastic (subsampled) gradient rl(w k ) rl(w k ) (S k), 1 S k X i2s k r`i(w k ) 3. Compute Hessian r 2 L(w k ) r 2 L(w k ) (S k), 1 S k X i2s k r 2`i(w k )

Second-Order Methods 4. Compute Newton s direction p k = r 2 L(w k ) 1 rl(w k ) 5. Find proper step length k =min L(w k + p k ) 6. Update parameters w k+1 w k + k p k

Second-Order Methods Advantages The rate of convergence is super-linear (quadratic for Newton method). They are resilient to problem ill-conditioning. They involve less parameter tuning. They are less sensitive to the choice of hyperparameters. J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

Second-Order Methods Disadvantages Computing the Hessian matrix is very expensive and requires massive storage. Computing the inverse of Hessian is not practical. J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer. Bottou et al., (2016) Optimization methods for large-scale machine learning, print arxiv:1606.04838

Quasi-Newton Methods 1. Construct a low-rank approximation of Hessian B k r 2 L(w k ) 2. Find the search direction by Minimizing the Quadratic Model of the objective function p k = argmin p2r n Q k (p), g T k p + 1 2 pt B k p T

Quasi-Newton Matrices Symmetric Easy and Fast Computation Satisfies Secant Condition B k+1 s k = y k s k, w k+1 w k y k, rl(w k+1 ) rl(w k )

Broyden Fletcher Goldfarb Shanno. J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

Broyden Fletcher Goldfarb Shanno. 1 B k+1 = B k s T k B B k s k s T k B k + 1 ks k y T k s k y k y T k, s k, w k+1 w k y k, rl(w k+1 ) rl(w k ) B 0 = k I J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

Quasi-Newton Methods Advantages The rate of convergence is super-linear. They are resilient to problem ill-conditioning. The second derivative is not required. They only use the gradient information to construct quasi-newton matrices. J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

Quasi-Newton Methods disadvantages The cost of storing the gradient informations can be expensive. The quasi-newton matrix can be dense. The quasi-newton matrix grow in size and rank in large-scale problems. J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

Limited-Memory BFGS Limited Memory Storage S k = s k m... s k 1 Y k = y k m... y k 1 L-BFGS Compact Representation B k = B 0 + km k T k B 0 = k I where apple S T k B 0 S k L k 1 k = B 0 S k Y k, Mk = L T k D k S T k Y k = L k + D k + U k

Limited-Memory Quasi-Newton Methods Low rank approximation Small memory of recent gradients. Low cost of computation of search direction. Linear or superlinear Convergence rate can be achieved. J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

Objectives B 0 = k I B k = B 0 + km k T k What is the best choice for initialization?

Overview on Quasi-Newton Optimization Strategies

Line Search Method w k rl(w k ) p k Quadratic model p k = argmin p2r n Q k (p), g T k p + 1 2 pt B k p T if B k is positive definite: p k = B 1 k g k J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

Line Search Method w k w k+1 k p k p k Wolfe conditions L(w k + k p k ) apple L(w k )+c 1 k rl(w k ) T p k rl(w k + k p k ) T p k c 2 rf(w k ) T p k J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

Trust Region Method p k = arg min p2r n Q(p) s.t. kpk 2 apple k k w k p k Toint et al. (2000), Trust Region Methods, SIAM.

Trust Region Method 6 4 2 Global minimum 0-2 -4 Newton step Local minimum -6-6 -4-2 0 2 4 6 J. J. Mor e and D. C. Sorensen, (1984) Newton s method, in Studies in Mathematics, Volume 24. Studies in Numerical Analysis, Math. Assoc., pp. 29 82.

L-BFGS Trust Region Optimization Method

L-BFGS in Trust Region B k = B 0 + km k T k Eigen-decomposition B k = P apple + k I 0 0 ki P T Sherman-Morrison-Woodbury Formula p k = 1 k + I k( M 1 = k + T k k) 1 T k gk, L. Adhikari et al. (2017) Limited-memory trust-region methods for sparse relaxation, in Proc.SPIE, vol. 10394. Brust et al, (2017). On solving L-SR1 trust-region subproblems, Computational Optimization and Applications, vol. 66, pp. 245 266.

L-BFGS in Trust Region vs. Line-Search 26th European Signal Processing Conference, Rome, Italy. September 2018.

Proposed Methods for Initialization of L-BFGS

Initialization Method I B 0 = k I B k = B 0 + km k T k Spectral estimate of Hessian k = yt k 1 y k 1 s T k 1 y k 1 k = arg min kb 1 0 y k 1 s k 1 k 2 2, B 0 = I J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

Initialization Method II Consider a quadratic function L(w) = 1 2 wt Hw + g T w r 2 L(w) =H We have HS k = Y k Therefore S T k HS k = S T k Y k Erway et al. (2018). Trust-Region Algorithms for Training Responses: Machine Learning Methods Using Indefinite Hessian Approximations, ArXiv e-prints.

Initialization Method II Since B k = B 0 + km k k = B 0 S k Y k, Mk = Secant Condition T k B 0 = apple S T k B 0 S k L k L T D k k k I 1 B k S T k = Y k We have S T k HS k k S T k S k = S T k km k T S k k

Initialization Method II S T k HS k k S T k S k = S T k km k T S k k General Eigen-Value Problem (L k + D k + L T k )z = S T k S k z Upper bound for initial value to avoid false curvature information k 2 (0, min)

Initialization Method III B 0 = k I B k = B 0 + km k T k S T k HS k k S T k S k = S T k km k T S k k Note that compact representation matrices contains k apple S T k B 0 S k L k 1 k = B 0 S k Y k, Mk = L T k D k

Initialization Method III General Eigen-Value Problem A z = B z Upper bound for initial values to avoid false curvature information k 2 (0, min) A = L k +D k +L T k S T k Y k DY T k S k 2 k 1 (S T k S k ÃS T k S k ), B = S T k S k + S T k S k BY T k S T k + S T k Y k BT S T k S k.

Applications in Deep Learning

Supervised Learning Features X = {x 1,x 2,...,x i,...,x N } Labels T = {t 1,t 2,...,t i,...,t N } : X! T

Supervised Learning Input: x Model (Predictor) Output: y (x; w) 0 1 2 3... 9 0 0 0.97 0.03... 0

Convolutional Neural Network Input conv1 pool1 conv2 pool2 hidden4 output (x; w) n = 413, 080 Lecun et al. (1998) Gradient-basedlearning applied to document recognition, Proceedings of the IEEE, vol. 86, no. 11, pp. 2278 2324.

Loss Function Target Output t = 0, 0, 1, 0,..., 0 y = 0, 0, 0.97, 0.03,..., 0 Cross-entropy `(t, y) = t. log(y) (1 t). log(1 y) Empirical Risk NX L(w) = 1 N `(t i, (x i,w)) i=1

Multi-Batch L-BFGS Shuffled Data Shuffled Data O k 1 O k S k O k = S k \ S k+1 O k O k+1 S k+1 O k+1 O k+2 S k+2 Berahas et al. (2016). A multi-batch L-BFGS method for machine learning, in Advances in Neural Information Processing Systems 29, pp. 1055 1063.

Computing gradients O k 1 O k S k O k = S k \ S k+1 O k O k+1 S k+1 g k = rl(w k ) (S k) = 1 S k X i2j k rl i (w k ) y k = rl(w k+1 ) (O k) rl(w k ) (O k) Berahas et al. (2016). A multi-batch L-BFGS method for machine learning, in Advances in Neural Information Processing Systems 29, pp. 1055 1063.

Experiment

Trust-Region Algorithm

Results - Loss Full Stochastic Full Stochastic

Results - Accuracy Full Stochastic Full Stochastic

Results - Training Time Full Stochastic Full Stochastic

Acknowledgement This research is supported by NSF Grants CMMI Award 1333326, IIS 1741490 and ACI 1429783.

Paper, Code and Slides: http://rafati.net jrafatiheravi@ucmerced.edu