Trust-Region Optimization Methods Using Limited-Memory Symmetric Rank-One Updates for Off-The-Shelf Machine Learning

Trust-Region Optimization Methods Using Limited-Memory Symmetric Rank-One Updates for Off-The-Shelf Machine Learning by Jennifer B. Erway, Joshua Griffin, Riadh Omheni, and Roummel Marcia Technical report 2017-2, Department of Mathematics and Statistics, Wake Forest University (2017) 1

TRUST-REGION OPTIMIZATION METHODS USING LIMITED-MEMORY SYMMETRIC RANK-ONE UPDATES FOR OFF-THE-SHELF MACHINE LEARNING JENNIFER B. ERWAY, JOSHUA GRIFFIN, RIADH OMHENI, AND ROUMMEL F. MARCIA Abstract. Machine learning (ML) problems are often posed as highly nonlinear and nonconvex unconstrained optimization problems. Methods for solving ML problems based on stochastic gradient descent generally requires fine-tuning many hyper-parameters. In this paper we propose an alternative approach for solving ML problems based on a quasi-newton trustregion framework that does not require extensive parameter tuning. Numerical experiments on the MNIST data set show that with a fixed computational time budget, the proposed approach achieves better results than other off-the-shelf approaches like limited-memory Broyden- Fletcher-Goldfarb-Shanno (BFGS) quasi-newton methods and Hessian-free methods. 1. Introduction There has been a resurgence of interest in nonconvex and nonlinear machine learning problems of the form (1) min f(w) 1 n = f i (w), w R m n where f i is a function of the ith observation in a training data set {(x i, y i )}. This relatively recent openness to using nonconvex models opens up exciting research opportunities to revisit promising yet still fringe algorithms not commonly used by the machine learning community. The goal of this paper is to demonstrate how recent research ([1]) on a well-known quasi-newton update formula opens the door to taking advantage and exploiting negative-curvature information in ways that more popular updates cannot. This capability might further illuminate why Hessian-free methods, which are computationally expensive, perform so well in practice. In the literature, (1) is often referred to as the empirical risk. Generally speaking, these problems have several features that make traditional optimization algorithms ineffective. First, both m and n are very large (e.g., typically m, n 10 6 ). Second, there is a special type of redundancy that is present in (1) due to similarity between data points; namely, if S is a random subset of indices of {1, 2,..., n}, then provided S is large enough, but S n, (e.g., n = 10 9 and S = 10 5 ), then (2) 1 n n i=1 f i (w) 1 S i=1 f i (w). Existing methods. Three prevalent algorithms in machine learning for minimizing (1) include stochastic gradient descent methods, limited-memory BFGS methods, and Hessian-free methods. The stochastic gradient descent (SGD) method [2] randomly selects an index j from {1, 2,..., n} at each iteration and w is updated as follows: w = w η j f j (w), where f j denotes the gradient of f j and the parameter η j is the learning rate. SGD exploits data set redundancy described in (2) and the iteration complexity is independent of n (in contrast to classical optimization algorithms which are explicitly dependent on n [3, 4, 5, 6, 7]). However, to enhance the performance of SGD in practice, developers must fine-tune many hyper-parameters leading to many variations of SGD, (e.g., see [8, 9, 10, 11, 12, 13]). In practice, finding an effective sequence {η j } can require solving the same problem many times to find the best sequence. This dilemma has led to a resurgence of interest in auto-tune algorithms that can aid the SGD user in this search [14, 15, 16, 17, 18]. i S Nonconvex optimization, quasi-newton methods, symmetric-rank one update, ma- Date: October 26, 2017. Key words and phrases. chine learning. 2

Symmetric Rank-One Updates for Off-The-Shelf Machine Learning 3 Because of the fine-tuning required for the hyper-parameters, these algorithms are not often used as off-the-shelf methods for machine learning. An alternative to SGD are quasi-newton methods. Generally speaking, quasi-newton methods generate a sequence of iterates using the rule w k+1 = w k + α k p k, where p k = B 1 k f(w k), B k is a quasi-newton matrix that is updated each iteration using gradient information, and α k is a step length. In machine learning, the Broyden-Fletcher-Goldfarb-Shanno (BFGS) update is exclusively used to generate the sequence of quasi-newton matrices {B k }; that is, (3) B k+1 = B k 1 s T k B B k s k s T k B k + 1 ks k yk T s y k yk T, k where s k = w k+1 w k and y k = f(w k+1 ) f(w k ). Note that B 0 is usually taken to be a positive scalar multiple of the identity. BFGS can be used to generate a sequence of positivedefinite matrices {B k }, guaranteeing a descent direction. In large-scale optimization, a limitedmemory version of this method is used (called limited-memory BFGS (L-BFGS)) to bound storage requirements and promote efficiency. In this case, only a few of the most-recently computed iteration and gradient information are stored. Unlike SGD, L-BFGS exploits information from previous iterations and requires little few hyper-parameter tuning. The main disadvantage of L-BFGS is that in a nonconvex environment, this method has the difficult task of approximating an indefinite matrix (the true Hessian) with a positive-definite matrix B k, which can result in the generation of nearly-singular matrices {B k }. Numerically, this creates need for heuristics such as periodically reinitializing B k to a multiple of the identity, effectively generating a steepestdescent direction in the next iteration. This can be a significant disadvantage for machine learning problems where model quality is highly correlated with the quality of initial steps [19]. Hessian-free methods to solve (1) leverage the ability to perform Hessian-vector multiplies without storing the entire second-derivative matrix itself [19, 20, 21]. These algorithms perform approximate updates of the form (4) w k+1 = w k η k p k with 2 f(w k )p k = f(w k ), where p k is an approximate Newton direction obtained using a conjugate-gradient (CG)-like algorithm. Traditional CG algorithms to solve linear systems such as (4) using only matrix vectorproducts to obtain an approximate Newton direction, making them attractive method for largescale optimization. Because 2 f(w) may be indefinite, modified variants are needed to adapt for local nonconvexity. Remarkably, Martens [20] was able to show that Hessian-free methods can achieve out-of-the-box competitive results compared to manually-tuned SGD on deep learning problems. Moreover, Pearlmutter [22] showed that matrix-vector products could be computed at a computational cost on the order of a gradient evaluation. However, since multiple matrix-vector products can be required to solve (4), the iteration complexity of these algorithms is significantly greater than L-BFGS. Thus, despite its allure of being a tune-free approach, Hessian-free methods are for the most part unused and unexplored in practice. Proposed approach. In this paper we propose an alternative approach for solving (1) based on quasi-newton methods. However, unlike existing approaches that use L-BFGS updates, we propose using an update that allows for indefinite Hessian approximations. Additionally, to better accommodate indefinite Hessian approximations we consider a trust-region framework rather than a line search framework, which is used in L-BFGS methods [23]. 2. METHODOLogy In this section, we describe quasi-newton matrices and then demonstrate their use in a trustregion method to solve (1). 2.1. Limited-Memory Symmetric Rank-One Update. Given a continuously-differentiable function f(w) as in (1), the symmetric rank-one (SR1) is given by (5) B k+1 = B k + (y k B k s k )(y k B k s k ) T (y k B k s k ) T s k

4 J. B. Erway, J. Griffin, R. Omheni, and R. F. Marcia where s j = w j+1 w j and y j = f(w j+1 ) f(w j ), for j = 0,..., k 1. The SR1 update can be written recursively and compactly using the outer product representation [ ] [ ] Mk Ψ T k B k+1 = B 0 + Ψ k where B 0 = γi for some γ 0, Ψ k is an n (k + 1) matrix and M k is a (k + 1) (k + 1) matrix. In particular, Byrd et al. [24] show that for SR1 matrices, Ψ k = Y k B 0 S k and M k = (D k +L k +L T k S T kb 0 S k ) 1. with S k = [ s 0 s 1 s 2 s k ] R n (k+1), and Y k = [ y 0 y 1 y 2 y k ] R n (k+1), and L k is the strictly lower triangular part, U k is the strictly upper triangular part, and D k is the diagonal part of Sk T Y k = L k + D k + U k. In our proposed approach, we use a limited-memory SR1 (L-SR1) update, where only r of the most recent pairs {s j, y j } are stored, where the value of r is typically very small so that r n. Unlike the BFGS update, the SR1 update is not necessarily positive definite. Moreover, under mild conditions, SR1 updates have been shown to exhibit stronger convergence properties to the true Hessian than BFGS updates [25]. 2.2. Trust-Region Methods. Trust-region methods are one of two important classes of methods for unconstrained optimization (see, e.g., [23, 26]). Basic trust-region methods generate a sequence of iterates {w k } given by w k+1 = w k + p k, where p k is an approximate solution to the trust-region subproblem given by (6) p k = argmin p R n Q k (p) = g T k p + 1 2 pt B k p s.t. p δ k, where g k = f(w k ), B k 2 f(w k ), and δ k is a given positive constant known as the trust-region radius. At the end of each trust-region iteration, δ k is used to update δ k+1 ; depending on how well the quadratic model predicted actual decreases in the function f(w) from w k to w k+1, the trust-region radius is possibly increased or decreased for the next iteration. Unlike in line search methods, the Hessian approximation B k need not be positive definite in trust-region methods. In our proposed approach, we use the L-SR1 matrix, which may be indefinite. Methods for solving the trust-region subproblem (6) are generally based on the optimality conditions for a global solution to the trust-region subproblem (see [27, 28]), given in the following theorem: Theorem: Let δ be a given positive constant. A vector p is a global solution of the trust-region problem (6) if and only if p 2 δ and there exists a unique σ 0 such that B k + σ I is positive semidefinite with (7) (B k + σ I)p = g k, and σ (δ k p 2 ) = 0. Moreover, if B k + σ I is positive definite, then the global minimizer is unique. The optimality condition in (7) indicates that the solution p must lie in interior of the trust region or p must lie on the boundary and that it is the exact minimizer of a nearby convex quadratic problem. In recent work by the authors [1], an efficient solver for the L-SR1 trust-region subproblem (6) was proposed. We describe this approach next. 2.3. Solving the Trust-Region Subproblem. Solving the trust-region subproblem (6) is generally the computational bottleneck of trust-region methods. To efficiently solve the subproblems, we exploit the structure of the L-SR1 matrix to obtain global solutions to high accuracy. To begin, we transform the optimality equations (7) using the spectral decomposition of B k. As shown in [1], the spectral decomposition of B k can be obtained efficiently using the compact formulation of B k. If B k = B 0 + ΨMΨ T and Ψ = QR is the thin QR factorization of Ψ, then B k = γi +QRMR T Q T, where B 0 = γi and γ > 0. Since RMR T is a small (k+1) (k+1) matrix, its spectral decomposition V ˆΛV T can be quickly computed. Then, letting P = [ QV (QV ) ] R n n such that P T P = P P T = I, the spectral decomposition of B k is given by [ ] ] B k = P ΛP T Λ1 0 [ˆΛ + γi 0, where Λ = =, 0 Λ 2 0 γi

Symmetric Rank-One Updates for Off-The-Shelf Machine Learning 5 where Λ 1 R (k+1) (k+1),, and Λ 2 = γi n (k+1). Using the spectral decomposition of B k, the optimality equations (7) become (8) (Λ + σ I)v = P T g and σ (δ k v 2 ) = 0, for some scalar σ 0 and v = P T p, where p is the global solution to (6). The Lagrange multiplier σ can be obtained by substituting k v 2 2 = (Λ + σ I) 1 P T g 2 (P T g) 2 i 2 = (λ i + σ ) 2 into the second equation in (8). Once σ is obtained, v can be computed from (8) and as well as the solution p = P v to the original trust-region subproblem (6). The proposed L-SR1 Trust-Region Method (L-SR1-TR) is described in Algorithm 1. For details on the subproblem solver and all related computations, see [1, Algorithm 1]. Define parameters: m, 0 < τ 1 < 0.5, 0 < ε; Initialize w 0 R n and compute g 0 = f(w 0 ); Let k = 0; while not converged if g k 2 ε then done Find p k that solves (6) using [1, Algorithm 1]; Compute ρ k = (f(w k + p k ) f(w k ))/Q k (p k ); Compute g k+1 and update B k+1 ; if ρ k τ 1 then w k+1 = w k + p k ; else w k+1 = w k ; end if Compute trust-region radius δ k+1 ; k k + 1; end while ALGORITHM 1: Limited-Memory SR1 Trust-Region (L-SR1-TR) Method 3. Numerical results In this section, we present numerical results comparing the performance of several methods, including the proposed L-SR1-TR approach, on the MNIST data set [29]. Here, we use a neural network model with various hidden layers. We optimize the model parameters w so that chosen model function p(x, w) predicts the observed target vector y as accurately as possible over the set of training data with respect to the chosen loss and activation functions. The model p(x, w) is formed by a predetermined sequence of affine transformations nested by a nonlinear activation function applied component-wise. Specifically, we set a o = x and define a j = σ(w j a j 1 + b j ) for j = 1,..., l where l denotes the number of layers and σ is the chosen activation function. In our experiments σ was defined to be the logistic function for the hidden layers while softmax paired with cross-entropy was used for the final output layer to form the resulting loss function element f i in (1) (see [30, Chap. 11] for details.) Two errors are used to train a network: training error and test error. The training error is used to define the optimization problem (1). Most approaches that use training data tend to find models that overfit the data, i.e., the models find relationships specific to the training data that are not true in general. In other words, overfitting prevents machine learning algorithms from correctly generalizing. To help prevent overfitting, an independent data set, called the test set is used to validate the accuracy of the model to gage its usefulness in making future predictions. Training errors and test errors are computed using the total Loss function f(w) in (1). For the results in this section, we compared the training and validation errors of three methods: (i) a Hessian-free utilizing the Generalized Gauss-Newton method described in [20], (ii) an L-BFGS method based on [31], and (iii) the proposed L-SR1 trust-region method (L-SR1-TR). (Because existing SGD methods are already finely tuned on this data set, they were not considered.) We tested the three methods on the full MNIST data set with full training and testing observations. We compared these methods on a variety of network configuration; all tests were performed in MATLAB. The limited-memory methods were run with r = 20. These experiments were designed i=0

6 J. B. Erway, J. Griffin, R. Omheni, and R. F. Marcia (a) (b) (c) (d) (e) (f) Figure 1. Plots of the Loss function versus iterations ((a)-(c)) and time ((d)- (f)) on the MNIST data set using 3 different hidden layers ({500, 350}, {350, 250, 150}, and {400, 250, 150, 100, 30}) with in input layer of size 784 and output layer of size 10. While the Hessian-free method has the lowest test and training loss with respect to iteration, the proposed L-SR1-TR approach outperforms both L-BFGS and Hessian-free methods with respect to wall-time. to test the hypothesis that one of the primary reasons why Hessian-free methods outperform BFGS variants in machine learning optimization problems is that they better approximate and exploit negative curvature. The results of three different network configuration of varying depth are given in Fig. 1. For each configuration, Loss versus iterations and time are plotted. Generally speaking, the Hessianfree method outperforms both L-BFGS and L-SR1-TR in terms of achieving the smallest test loss (and training loss) in the fewest iterations, with L-SR1-TR outperforming L-BFGS. However, the cost per iteration for Hessian-free is significantly higher since Hessian-free uses matrix multiplies whereas the quasi-newton methods use a (much cheaper) single gradient evaluation. Thus, in terms of wall-time, L-SR1-TR is the faster method, obtaining the best solution in the least amount of time given a one-hour window to solve the given network. 4. Conclusion In this paper, we presented an alternative approach for solving machine learning problems that is based on the L-SR1 update that allows for indefinite Hessian approximations. This approach is particularly suitable for non-convex problems where exploiting directions of negative curvature is crucial. Numerical experiments demonstrate that the proposed approach outperforms the more commonly used quasi-newton approach (L-BFGS) both in terms of computational efficiency and test and training loss. Acknowledgments. J. Erway s research was funded by NSF Grants CMMI-1334042 and IIS- 1741264. R. Marcia s research was funded by NSF Grants CMMI-1333326 and IIS-1741490. References [1] J. Brust, J. B. Erway, and R. F. Marcia, On solving L-SR1 trust-region subproblems, Comput. Optim. Appl., vol. 66, no. 2, pp. 245 266, 2017. [2] H. Robbins and S. Monro, A stochastic approximation method, The Annals of Mathematical Statistics, vol. 22, no. 3, pp. 400 407, 1951. [3] N. N. Schraudolph, J. Yu, and S. Günter, A stochastic quasi-newton method for online convex optimization, in AISTATS, 2007, pp. 436 443.

Symmetric Rank-One Updates for Off-The-Shelf Machine Learning 7 [4] R. H. Byrd, S. L. Hansen, J. Nocedal, and Y. Singer, A stochastic quasi-newton method for large-scale optimization, SIAM J. Opt., vol. 26, no. 2, pp. 1008 1031, 2016. [5] F. Curtis, A self-correcting variable-metric algorithm for stochastic optimization, in ICML, 2016, pp. 632 641. [6] P. Moritz, R. Nishihara, and M. Jordan, A linearly-convergent stochastic L-BFGS algorithm, in AISTATS, 2016, pp. 249 258. [7] R. Gower, D. Goldfarb, and P. Richtárik, Stochastic block BFGS: squeezing more curvature out of data, in ICML, 2016, pp. 1869 1878. [8] J. C. Duchi, E. Hazan, and Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, Journal of Machine Learning Research, vol. 12, pp. 2121 2159, 2011. [9] M. D. Zeiler, ADADELTA: an adaptive learning rate method, CoRR, vol. abs/1212.5701, 2012. [10] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, CoRR, vol. abs/1412.6980, 2014. [11] S. Zhang, A. E. Choromanska, and Y. LeCun, Deep learning with elastic averaging SGD, in NIPS, 2015, pp. 685 693. [12] S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in ICML, 2015, pp. 448 456. [13] M. Abadi et al., Tensorflow: Large-scale machine learning on heterogeneous distributed systems, CoRR, vol. abs/1603.04467, 2016. [14] J. Ngiam et al., On optimization methods for deep learning, in ICML, 2011, pp. 265 272. [15] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, Algorithms for hyper-parameter optimization, in NIPS, 2011, pp. 2546 2554. [16] J. Bergstra, D. Yamins, and D. Cox, Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures, in ICML, 2013, pp. 115 123. [17] J. Snoek, H. Larochelle, and R. P. Adams, Practical bayesian optimization of machine learning algorithms, in NIPS, 2012, pp. 2951 2959. [18] D. Maclaurin, D. Duvenaud, and R. Adams, Gradient-based hyperparameter optimization through reversible learning, in ICML, 2015, pp. 2113 2122. [19] J. Martens and I. Sutskever, Training deep and recurrent networks with hessian-free optimization, in Neural Networks: Tricks of the Trade, pp. 479 535. Springer, 2012. [20] J. Martens, Deep learning via Hessian-free optimization, in ICML, 2010, pp. 735 742. [21] Y. N. Dauphin et al., Identifying and attacking the saddle point problem in high-dimensional non-convex optimization, in NIPS, 2014, pp. 2933 2941. [22] B. A. Pearlmutter, Fast exact multiplication by the Hessian, Neural computation, vol. 6, no. 1, pp. 147 160, 1994. [23] J. Nocedal and S. J. Wright, Numerical Optimization, Springer, New York, 2nd edition, 2006. [24] R. H. Byrd, J. Nocedal, and R. B. Schnabel, Representations of quasi-newton matrices and their use in limited-memory methods, Math. Program., vol. 63, pp. 129 156, 1994. [25] A. R. Conn, N. I. M. Gould, and Ph. L. Toint, Convergence of quasi-newton matrices generated by the symmetric rank one update, Math. Program., vol. 50, no. 2, pp. 177 195, 1991. [26] A. R. Conn, N. I. M. Gould, and P. L. Toint, Trust-Region Methods, Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2000. [27] D. M. Gay, Computing optimal locally constrained steps, SIAM J. Sci. Statist. Comput., vol. 2, no. 2, pp. 186 197, 1981. [28] J. J. Moré and D. C. Sorensen, Computing a trust region step, SIAM J. Sci. and Statist. Comput., vol. 4, pp. 553 572, 1983. [29] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, vol. 86, no. 11, pp. 2278 2324, 1998. [30] J. Friedman, T. Hastie, and R. Tibshirani, The elements of statistical learning, vol. 1, Springer series in statistics New York, 2001. [31] D. C. Liu and J.Nocedal, On the limited memory method for large scale optimization, Math. Prog. B, vol. 45, pp. 503 528, 1989. E-mail address: erwayjb@wfu.edu Department of Mathematics, Wake Forest University, Winston-Salem, NC 27109, USA E-mail address: Joshua.Griffin@sas.com SAS, Cary, NC 27513, USA E-mail address: Riadh.Omheni@sas.com SAS, Cary, NC 27513, USA E-mail address: rmarcia@ucmerced.edu Applied Mathematics, University of California, Merced, Merced, CA 95343, USA