Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning
|
|
- Violet Booker
- 5 years ago
- Views:
Transcription
1 Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Jacob Rafati Ph.D. Candidate, Electrical Engineering and Computer Science University of California, Merced
2 Agenda Introduction, Problem Statement and Motivations Overview on Quasi-Newton Optimization Methods L-BFGS Trust Region Optimization Method Proposed Methods for Initialization of L-BFGS Application in Deep Learning (Image Classification Task)
3 Introduction, Problem Statement and Motivations
4 Unconstrained Optimization Problem NX min w2r n L(w), 1 N `i(w) i=1 L : R n! R
5 Optimization Algorithms Bottou et al., (2016) Optimization methods for large-scale machine learning, print arxiv:
6 Optimization Algorithms 1. Start from a random point w 0 2. Repeat each iteration, k =0, 1, 2,..., 3. Choose a search direction p k 4. Choose a step size k 5. Update parameters w k+1 w k + k p k 6. Until krlk <
7 Properties of Objective Function NX min w2r n L(w), 1 N `i(w) i=1 n and N are both large in modern applications. L(w) is a non-convex and nonlinear function. r 2 L(w) is ill-conditioned. Computing full gradient, rl is expensive. Computing Hessian, r 2 L is not practical.
8 Stochastic Gradient Decent 1. Sample indices S k {1, 2,...,N} 2. Compute stochastic (subsampled) gradient rl(w k ) rl(w k ) (S k), 1 S k X i2s k r`i(w k ) 3. Assign a learning rate k 4. Update parameters using p k = rl(w k ) (S k) w k+1 w k k rl(w k ) (S k) H. Robbins, D. Siegmund. (1971). A convergence theorem for non negative almost supermartingales and some applications. Optimizing methods in statistics.
9 Advantages of SGD SGD algorithms are very easy to implement. SGD requires only computing the gradient. SGD has a low cost-per-iteration. Bottou et al., (2016) Optimization methods for large-scale machine learning, print arxiv:
10 Disadvantages of SGD Very sensitive to the ill-conditioning problem and scaling. Requires fine-tuning many hyper-parameters. Unlikely exhibit acceptable performance on first try. requires many trials and errors. Can stuck in a saddle-point instead of local minimum. Sublinear and slow rate of convergence. Bottou et al., (2016). Optimization methods for large-scale machine learning. print arxiv: J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.
11 Second-Order Methods 1. Sample indices S k {1, 2,...,N} 2. Compute stochastic (subsampled) gradient rl(w k ) rl(w k ) (S k), 1 S k X i2s k r`i(w k ) 3. Compute Hessian r 2 L(w k ) r 2 L(w k ) (S k), 1 S k X i2s k r 2`i(w k )
12 Second-Order Methods 4. Compute Newton s direction p k = r 2 L(w k ) 1 rl(w k ) 5. Find proper step length k =min L(w k + p k ) 6. Update parameters w k+1 w k + k p k
13 Second-Order Methods Advantages The rate of convergence is super-linear (quadratic for Newton method). They are resilient to problem ill-conditioning. They involve less parameter tuning. They are less sensitive to the choice of hyperparameters. J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.
14 Second-Order Methods Disadvantages Computing the Hessian matrix is very expensive and requires massive storage. Computing the inverse of Hessian is not practical. J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer. Bottou et al., (2016) Optimization methods for large-scale machine learning, print arxiv:
15 Quasi-Newton Methods 1. Construct a low-rank approximation of Hessian B k r 2 L(w k ) 2. Find the search direction by Minimizing the Quadratic Model of the objective function p k = argmin p2r n Q k (p), g T k p pt B k p T
16 Quasi-Newton Matrices Symmetric Easy and Fast Computation Satisfies Secant Condition B k+1 s k = y k s k, w k+1 w k y k, rl(w k+1 ) rl(w k )
17 Broyden Fletcher Goldfarb Shanno. J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.
18 Broyden Fletcher Goldfarb Shanno. 1 B k+1 = B k s T k B B k s k s T k B k + 1 ks k y T k s k y k y T k, s k, w k+1 w k y k, rl(w k+1 ) rl(w k ) B 0 = k I J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.
19 Quasi-Newton Methods Advantages The rate of convergence is super-linear. They are resilient to problem ill-conditioning. The second derivative is not required. They only use the gradient information to construct quasi-newton matrices. J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.
20 Quasi-Newton Methods disadvantages The cost of storing the gradient informations can be expensive. The quasi-newton matrix can be dense. The quasi-newton matrix grow in size and rank in large-scale problems. J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.
21 Limited-Memory BFGS Limited Memory Storage S k = s k m... s k 1 Y k = y k m... y k 1 L-BFGS Compact Representation B k = B 0 + km k T k B 0 = k I where apple S T k B 0 S k L k 1 k = B 0 S k Y k, Mk = L T k D k S T k Y k = L k + D k + U k
22 Limited-Memory Quasi-Newton Methods Low rank approximation Small memory of recent gradients. Low cost of computation of search direction. Linear or superlinear Convergence rate can be achieved. J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.
23 Objectives B 0 = k I B k = B 0 + km k T k What is the best choice for initialization?
24 Overview on Quasi-Newton Optimization Strategies
25 Line Search Method w k rl(w k ) p k Quadratic model p k = argmin p2r n Q k (p), g T k p pt B k p T if B k is positive definite: p k = B 1 k g k J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.
26 Line Search Method w k w k+1 k p k p k Wolfe conditions L(w k + k p k ) apple L(w k )+c 1 k rl(w k ) T p k rl(w k + k p k ) T p k c 2 rf(w k ) T p k J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.
27 Trust Region Method p k = arg min p2r n Q(p) s.t. kpk 2 apple k k w k p k Toint et al. (2000), Trust Region Methods, SIAM.
28 Trust Region Method Global minimum Newton step Local minimum J. J. Mor e and D. C. Sorensen, (1984) Newton s method, in Studies in Mathematics, Volume 24. Studies in Numerical Analysis, Math. Assoc., pp
29 L-BFGS Trust Region Optimization Method
30 L-BFGS in Trust Region B k = B 0 + km k T k Eigen-decomposition B k = P apple + k I 0 0 ki P T Sherman-Morrison-Woodbury Formula p k = 1 k + I k( M 1 = k + T k k) 1 T k gk, L. Adhikari et al. (2017) Limited-memory trust-region methods for sparse relaxation, in Proc.SPIE, vol Brust et al, (2017). On solving L-SR1 trust-region subproblems, Computational Optimization and Applications, vol. 66, pp
31 L-BFGS in Trust Region vs. Line-Search 26th European Signal Processing Conference, Rome, Italy. September 2018.
32 Proposed Methods for Initialization of L-BFGS
33 Initialization Method I B 0 = k I B k = B 0 + km k T k Spectral estimate of Hessian k = yt k 1 y k 1 s T k 1 y k 1 k = arg min kb 1 0 y k 1 s k 1 k 2 2, B 0 = I J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.
34 Initialization Method II Consider a quadratic function L(w) = 1 2 wt Hw + g T w r 2 L(w) =H We have HS k = Y k Therefore S T k HS k = S T k Y k Erway et al. (2018). Trust-Region Algorithms for Training Responses: Machine Learning Methods Using Indefinite Hessian Approximations, ArXiv e-prints.
35 Initialization Method II Since B k = B 0 + km k k = B 0 S k Y k, Mk = Secant Condition T k B 0 = apple S T k B 0 S k L k L T D k k k I 1 B k S T k = Y k We have S T k HS k k S T k S k = S T k km k T S k k
36 Initialization Method II S T k HS k k S T k S k = S T k km k T S k k General Eigen-Value Problem (L k + D k + L T k )z = S T k S k z Upper bound for initial value to avoid false curvature information k 2 (0, min)
37 Initialization Method III B 0 = k I B k = B 0 + km k T k S T k HS k k S T k S k = S T k km k T S k k Note that compact representation matrices contains k apple S T k B 0 S k L k 1 k = B 0 S k Y k, Mk = L T k D k
38 Initialization Method III General Eigen-Value Problem A z = B z Upper bound for initial values to avoid false curvature information k 2 (0, min) A = L k +D k +L T k S T k Y k DY T k S k 2 k 1 (S T k S k ÃS T k S k ), B = S T k S k + S T k S k BY T k S T k + S T k Y k BT S T k S k.
39 Applications in Deep Learning
40 Supervised Learning Features X = {x 1,x 2,...,x i,...,x N } Labels T = {t 1,t 2,...,t i,...,t N } : X! T
41 Supervised Learning Input: x Model (Predictor) Output: y (x; w)
42 Convolutional Neural Network Input conv1 pool1 conv2 pool2 hidden4 output (x; w) n = 413, 080 Lecun et al. (1998) Gradient-basedlearning applied to document recognition, Proceedings of the IEEE, vol. 86, no. 11, pp
43 Loss Function Target Output t = 0, 0, 1, 0,..., 0 y = 0, 0, 0.97, 0.03,..., 0 Cross-entropy `(t, y) = t. log(y) (1 t). log(1 y) Empirical Risk NX L(w) = 1 N `(t i, (x i,w)) i=1
44 Multi-Batch L-BFGS Shuffled Data Shuffled Data O k 1 O k S k O k = S k \ S k+1 O k O k+1 S k+1 O k+1 O k+2 S k+2 Berahas et al. (2016). A multi-batch L-BFGS method for machine learning, in Advances in Neural Information Processing Systems 29, pp
45 Computing gradients O k 1 O k S k O k = S k \ S k+1 O k O k+1 S k+1 g k = rl(w k ) (S k) = 1 S k X i2j k rl i (w k ) y k = rl(w k+1 ) (O k) rl(w k ) (O k) Berahas et al. (2016). A multi-batch L-BFGS method for machine learning, in Advances in Neural Information Processing Systems 29, pp
46 Experiment
47 Trust-Region Algorithm
48 Results - Loss Full Stochastic Full Stochastic
49 Results - Accuracy Full Stochastic Full Stochastic
50 Results - Training Time Full Stochastic Full Stochastic
51 Acknowledgement This research is supported by NSF Grants CMMI Award , IIS and ACI
52 Paper, Code and Slides:
Improving L-BFGS Initialization For Trust-Region Methods In Deep Learning
Improving L-BFGS Initialization For Trust-Region Methods In Deep Learning Jacob Rafati Electrical Engineering and Computer Science University of California, Merced Merced, CA 95340 USA jrafatiheravi@ucmerced.edu
More informationTrust-Region Optimization Methods Using Limited-Memory Symmetric Rank-One Updates for Off-The-Shelf Machine Learning
Trust-Region Optimization Methods Using Limited-Memory Symmetric Rank-One Updates for Off-The-Shelf Machine Learning by Jennifer B. Erway, Joshua Griffin, Riadh Omheni, and Roummel Marcia Technical report
More informationQuasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno February 6, / 25 (BFG. Limited memory BFGS (L-BFGS)
Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno (BFGS) Limited memory BFGS (L-BFGS) February 6, 2014 Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb
More informationQuasi-Newton Methods. Javier Peña Convex Optimization /36-725
Quasi-Newton Methods Javier Peña Convex Optimization 10-725/36-725 Last time: primal-dual interior-point methods Consider the problem min x subject to f(x) Ax = b h(x) 0 Assume f, h 1,..., h m are convex
More information5 Quasi-Newton Methods
Unconstrained Convex Optimization 26 5 Quasi-Newton Methods If the Hessian is unavailable... Notation: H = Hessian matrix. B is the approximation of H. C is the approximation of H 1. Problem: Solve min
More informationHigher-Order Methods
Higher-Order Methods Stephen J. Wright 1 2 Computer Sciences Department, University of Wisconsin-Madison. PCMI, July 2016 Stephen Wright (UW-Madison) Higher-Order Methods PCMI, July 2016 1 / 25 Smooth
More informationImproving the Convergence of Back-Propogation Learning with Second Order Methods
the of Back-Propogation Learning with Second Order Methods Sue Becker and Yann le Cun, Sept 1988 Kasey Bray, October 2017 Table of Contents 1 with Back-Propagation 2 the of BP 3 A Computationally Feasible
More informationQuasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization
Quasi-Newton Methods Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization 10-725 Last time: primal-dual interior-point methods Given the problem min x f(x) subject to h(x)
More informationLBFGS. John Langford, Large Scale Machine Learning Class, February 5. (post presentation version)
LBFGS John Langford, Large Scale Machine Learning Class, February 5 (post presentation version) We are still doing Linear Learning Features: a vector x R n Label: y R Goal: Learn w R n such that ŷ w (x)
More information2. Quasi-Newton methods
L. Vandenberghe EE236C (Spring 2016) 2. Quasi-Newton methods variable metric methods quasi-newton methods BFGS update limited-memory quasi-newton methods 2-1 Newton method for unconstrained minimization
More informationNonlinear Optimization Methods for Machine Learning
Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks
More informationOptimization II: Unconstrained Multivariable
Optimization II: Unconstrained Multivariable CS 205A: Mathematical Methods for Robotics, Vision, and Graphics Justin Solomon CS 205A: Mathematical Methods Optimization II: Unconstrained Multivariable 1
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationShiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method
Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 3 Gradient Method Shiqian Ma, MAT-258A: Numerical Optimization 2 3.1. Gradient method Classical gradient method: to minimize a differentiable convex
More informationALGORITHM XXX: SC-SR1: MATLAB SOFTWARE FOR SOLVING SHAPE-CHANGING L-SR1 TRUST-REGION SUBPROBLEMS
ALGORITHM XXX: SC-SR1: MATLAB SOFTWARE FOR SOLVING SHAPE-CHANGING L-SR1 TRUST-REGION SUBPROBLEMS JOHANNES BRUST, OLEG BURDAKOV, JENNIFER B. ERWAY, ROUMMEL F. MARCIA, AND YA-XIANG YUAN Abstract. We present
More informationStochastic Quasi-Newton Methods
Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent
More informationOptimization Methods for Machine Learning
Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction
More informationOptimization II: Unconstrained Multivariable
Optimization II: Unconstrained Multivariable CS 205A: Mathematical Methods for Robotics, Vision, and Graphics Doug James (and Justin Solomon) CS 205A: Mathematical Methods Optimization II: Unconstrained
More informationOPER 627: Nonlinear Optimization Lecture 14: Mid-term Review
OPER 627: Nonlinear Optimization Lecture 14: Mid-term Review Department of Statistical Sciences and Operations Research Virginia Commonwealth University Oct 16, 2013 (Lecture 14) Nonlinear Optimization
More informationConvex Optimization CMU-10725
Convex Optimization CMU-10725 Quasi Newton Methods Barnabás Póczos & Ryan Tibshirani Quasi Newton Methods 2 Outline Modified Newton Method Rank one correction of the inverse Rank two correction of the
More informationIPAM Summer School Optimization methods for machine learning. Jorge Nocedal
IPAM Summer School 2012 Tutorial on Optimization methods for machine learning Jorge Nocedal Northwestern University Overview 1. We discuss some characteristics of optimization problems arising in deep
More informationQuasi-Newton methods for minimization
Quasi-Newton methods for minimization Lectures for PHD course on Numerical optimization Enrico Bertolazzi DIMS Universitá di Trento November 21 December 14, 2011 Quasi-Newton methods for minimization 1
More informationLecture 1: Supervised Learning
Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)
More informationIntroduction to Logistic Regression and Support Vector Machine
Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel
More informationSecond-Order Methods for Stochastic Optimization
Second-Order Methods for Stochastic Optimization Frank E. Curtis, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationSub-Sampled Newton Methods
Sub-Sampled Newton Methods F. Roosta-Khorasani and M. W. Mahoney ICSI and Dept of Statistics, UC Berkeley February 2016 F. Roosta-Khorasani and M. W. Mahoney (UCB) Sub-Sampled Newton Methods Feb 2016 1
More informationORIE 6326: Convex Optimization. Quasi-Newton Methods
ORIE 6326: Convex Optimization Quasi-Newton Methods Professor Udell Operations Research and Information Engineering Cornell April 10, 2017 Slides on steepest descent and analysis of Newton s method adapted
More informationDENSE INITIALIZATIONS FOR LIMITED-MEMORY QUASI-NEWTON METHODS
DENSE INITIALIZATIONS FOR LIMITED-MEMORY QUASI-NEWTON METHODS by Johannes Brust, Oleg Burdaov, Jennifer B. Erway, and Roummel F. Marcia Technical Report 07-, Department of Mathematics and Statistics, Wae
More informationMethods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent
Nonlinear Optimization Steepest Descent and Niclas Börlin Department of Computing Science Umeå University niclas.borlin@cs.umu.se A disadvantage with the Newton method is that the Hessian has to be derived
More informationIncremental Quasi-Newton methods with local superlinear convergence rate
Incremental Quasi-Newton methods wh local superlinear convergence rate Aryan Mokhtari, Mark Eisen, and Alejandro Ribeiro Department of Electrical and Systems Engineering Universy of Pennsylvania Int. Conference
More informationImproved Damped Quasi-Newton Methods for Unconstrained Optimization
Improved Damped Quasi-Newton Methods for Unconstrained Optimization Mehiddin Al-Baali and Lucio Grandinetti August 2015 Abstract Recently, Al-Baali (2014) has extended the damped-technique in the modified
More informationStochastic Optimization Methods for Machine Learning. Jorge Nocedal
Stochastic Optimization Methods for Machine Learning Jorge Nocedal Northwestern University SIAM CSE, March 2017 1 Collaborators Richard Byrd R. Bollagragada N. Keskar University of Colorado Northwestern
More informationAdaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade
Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random
More informationNonlinear Optimization: What s important?
Nonlinear Optimization: What s important? Julian Hall 10th May 2012 Convexity: convex problems A local minimizer is a global minimizer A solution of f (x) = 0 (stationary point) is a minimizer A global
More informationHigh Order Methods for Empirical Risk Minimization
High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu Thanks to: Aryan Mokhtari, Mark
More informationAlgorithms for Constrained Optimization
1 / 42 Algorithms for Constrained Optimization ME598/494 Lecture Max Yi Ren Department of Mechanical Engineering, Arizona State University April 19, 2015 2 / 42 Outline 1. Convergence 2. Sequential quadratic
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationUNIVERSITY OF CALIFORNIA, MERCED LARGE-SCALE QUASI-NEWTON TRUST-REGION METHODS: HIGH-ACCURACY SOLVERS, DENSE INITIALIZATIONS, AND EXTENSIONS
UNIVERSITY OF CALIFORNIA, MERCED LARGE-SCALE QUASI-NEWTON TRUST-REGION METHODS: HIGH-ACCURACY SOLVERS, DENSE INITIALIZATIONS, AND EXTENSIONS A dissertation submitted in partial satisfaction of the requirements
More informationOptimization and Root Finding. Kurt Hornik
Optimization and Root Finding Kurt Hornik Basics Root finding and unconstrained smooth optimization are closely related: Solving ƒ () = 0 can be accomplished via minimizing ƒ () 2 Slide 2 Basics Root finding
More informationIntroduction. New Nonsmooth Trust Region Method for Unconstraint Locally Lipschitz Optimization Problems
New Nonsmooth Trust Region Method for Unconstraint Locally Lipschitz Optimization Problems Z. Akbari 1, R. Yousefpour 2, M. R. Peyghami 3 1 Department of Mathematics, K.N. Toosi University of Technology,
More informationUnconstrained optimization
Chapter 4 Unconstrained optimization An unconstrained optimization problem takes the form min x Rnf(x) (4.1) for a target functional (also called objective function) f : R n R. In this chapter and throughout
More informationMATH 4211/6211 Optimization Quasi-Newton Method
MATH 4211/6211 Optimization Quasi-Newton Method Xiaojing Ye Department of Mathematics & Statistics Georgia State University Xiaojing Ye, Math & Stat, Georgia State University 0 Quasi-Newton Method Motivation:
More informationECS550NFB Introduction to Numerical Methods using Matlab Day 2
ECS550NFB Introduction to Numerical Methods using Matlab Day 2 Lukas Laffers lukas.laffers@umb.sk Department of Mathematics, University of Matej Bel June 9, 2015 Today Root-finding: find x that solves
More informationOptimization for neural networks
0 - : Optimization for neural networks Prof. J.C. Kao, UCLA Optimization for neural networks We previously introduced the principle of gradient descent. Now we will discuss specific modifications we make
More informationChapter 4. Unconstrained optimization
Chapter 4. Unconstrained optimization Version: 28-10-2012 Material: (for details see) Chapter 11 in [FKS] (pp.251-276) A reference e.g. L.11.2 refers to the corresponding Lemma in the book [FKS] PDF-file
More informationVariable Metric Stochastic Approximation Theory
Variable Metric Stochastic Approximation Theory Abstract We provide a variable metric stochastic approximation theory. In doing so, we provide a convergence theory for a large class of online variable
More informationRegression with Numerical Optimization. Logistic
CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204
More informationMethods for Unconstrained Optimization Numerical Optimization Lectures 1-2
Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2 Coralia Cartis, University of Oxford INFOMM CDT: Modelling, Analysis and Computation of Continuous Real-World Problems Methods
More informationLogistic Regression. COMP 527 Danushka Bollegala
Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will
More informationA Stochastic Quasi-Newton Method for Online Convex Optimization
A Stochastic Quasi-Newton Method for Online Convex Optimization Nicol N. Schraudolph nic.schraudolph@nicta.com.au Jin Yu jin.yu@rsise.anu.edu.au Statistical Machine Learning, National ICT Australia Locked
More informationNewton s Method. Ryan Tibshirani Convex Optimization /36-725
Newton s Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, Properties and examples: f (y) = max x
More informationSeminal papers in nonlinear optimization
Seminal papers in nonlinear optimization Nick Gould, CSED, RAL, Chilton, OX11 0QX, England (n.gould@rl.ac.uk) December 7, 2006 The following papers are classics in the field. Although many of them cover
More informationAdaptive Gradient Methods AdaGrad / Adam
Case Study 1: Estimating Click Probabilities Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 The Problem with GD (and SGD)
More informationDeep Learning & Neural Networks Lecture 4
Deep Learning & Neural Networks Lecture 4 Kevin Duh Graduate School of Information Science Nara Institute of Science and Technology Jan 23, 2014 2/20 3/20 Advanced Topics in Optimization Today we ll briefly
More informationProgramming, numerics and optimization
Programming, numerics and optimization Lecture C-3: Unconstrained optimization II Łukasz Jankowski ljank@ippt.pan.pl Institute of Fundamental Technological Research Room 4.32, Phone +22.8261281 ext. 428
More informationNonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University.
Nonlinear Models Numerical Methods for Deep Learning Lars Ruthotto Departments of Mathematics and Computer Science, Emory University Intro 1 Course Overview Intro 2 Course Overview Lecture 1: Linear Models
More informationNumerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09
Numerical Optimization 1 Working Horse in Computer Vision Variational Methods Shape Analysis Machine Learning Markov Random Fields Geometry Common denominator: optimization problems 2 Overview of Methods
More informationSub-Sampled Newton Methods for Machine Learning. Jorge Nocedal
Sub-Sampled Newton Methods for Machine Learning Jorge Nocedal Northwestern University Goldman Lecture, Sept 2016 1 Collaborators Raghu Bollapragada Northwestern University Richard Byrd University of Colorado
More informationMaria Cameron. f(x) = 1 n
Maria Cameron 1. Local algorithms for solving nonlinear equations Here we discuss local methods for nonlinear equations r(x) =. These methods are Newton, inexact Newton and quasi-newton. We will show that
More informationHigh Order Methods for Empirical Risk Minimization
High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu IPAM Workshop of Emerging Wireless
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationOptimization 2. CS5240 Theoretical Foundations in Multimedia. Leow Wee Kheng
Optimization 2 CS5240 Theoretical Foundations in Multimedia Leow Wee Kheng Department of Computer Science School of Computing National University of Singapore Leow Wee Kheng (NUS) Optimization 2 1 / 38
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationABSTRACT 1. INTRODUCTION
A DIAGONAL-AUGMENTED QUASI-NEWTON METHOD WITH APPLICATION TO FACTORIZATION MACHINES Aryan Mohtari and Amir Ingber Department of Electrical and Systems Engineering, University of Pennsylvania, PA, USA Big-data
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationGradient-Based Optimization
Multidisciplinary Design Optimization 48 Chapter 3 Gradient-Based Optimization 3. Introduction In Chapter we described methods to minimize (or at least decrease) a function of one variable. While problems
More informationCS60021: Scalable Data Mining. Large Scale Machine Learning
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance
More informationStochastic Gradient Methods for Large-Scale Machine Learning
Stochastic Gradient Methods for Large-Scale Machine Learning Leon Bottou Facebook A Research Frank E. Curtis Lehigh University Jorge Nocedal Northwestern University 1 Our Goal 1. This is a tutorial about
More informationData Mining (Mineria de Dades)
Data Mining (Mineria de Dades) Lluís A. Belanche belanche@lsi.upc.edu Soft Computing Research Group Dept. de Llenguatges i Sistemes Informàtics (Software department) Universitat Politècnica de Catalunya
More informationAM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods
AM 205: lecture 19 Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods Optimality Conditions: Equality Constrained Case As another example of equality
More informationComparative study of Optimization methods for Unconstrained Multivariable Nonlinear Programming Problems
International Journal of Scientific and Research Publications, Volume 3, Issue 10, October 013 1 ISSN 50-3153 Comparative study of Optimization methods for Unconstrained Multivariable Nonlinear Programming
More informationNonlinear Programming
Nonlinear Programming Kees Roos e-mail: C.Roos@ewi.tudelft.nl URL: http://www.isa.ewi.tudelft.nl/ roos LNMB Course De Uithof, Utrecht February 6 - May 8, A.D. 2006 Optimization Group 1 Outline for week
More informationOPER 627: Nonlinear Optimization Lecture 9: Trust-region methods
OPER 627: Nonlinear Optimization Lecture 9: Trust-region methods Department of Statistical Sciences and Operations Research Virginia Commonwealth University Sept 25, 2013 (Lecture 9) Nonlinear Optimization
More informationAccelerating SVRG via second-order information
Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationNon-Convex Optimization. CS6787 Lecture 7 Fall 2017
Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper
More informationNormalization Techniques in Training of Deep Neural Networks
Normalization Techniques in Training of Deep Neural Networks Lei Huang ( 黄雷 ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th,
More informationOPTIMIZATION METHODS IN DEEP LEARNING
Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate
More informationA Trust-region-based Sequential Quadratic Programming Algorithm
Downloaded from orbit.dtu.dk on: Oct 19, 2018 A Trust-region-based Sequential Quadratic Programming Algorithm Henriksen, Lars Christian; Poulsen, Niels Kjølstad Publication date: 2010 Document Version
More informationQuasi-Newton Methods
Newton s Method Pros and Cons Quasi-Newton Methods MA 348 Kurt Bryan Newton s method has some very nice properties: It s extremely fast, at least once it gets near the minimum, and with the simple modifications
More informationEmpirical Risk Minimization and Optimization
Statistical Machine Learning Notes 3 Empirical Risk Minimization and Optimization Instructor: Justin Domke 1 Empirical Risk Minimization Empirical Risk Minimization is a fancy sounding name for a very
More informationStep lengths in BFGS method for monotone gradients
Noname manuscript No. (will be inserted by the editor) Step lengths in BFGS method for monotone gradients Yunda Dong Received: date / Accepted: date Abstract In this paper, we consider how to directly
More informationTerm Paper: How and Why to Use Stochastic Gradient Descent?
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
More informationApproximate Newton Methods and Their Local Convergence
Haishan Ye Luo Luo Zhihua Zhang 2 Abstract Many machine learning models are reformulated as optimization problems. Thus, it is important to solve a large-scale optimization problem in big data applications.
More informationMath 408A: Non-Linear Optimization
February 12 Broyden Updates Given g : R n R n solve g(x) = 0. Algorithm: Broyden s Method Initialization: x 0 R n, B 0 R n n Having (x k, B k ) compute (x k+1, B x+1 ) as follows: Solve B k s k = g(x
More informationWarm up. Regrade requests submitted directly in Gradescope, do not instructors.
Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required
More informationHigh Order Methods for Empirical Risk Minimization
High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu IPAM Workshop of Emerging Wireless
More informationAdvanced computational methods X Selected Topics: SGD
Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety
More informationSelected Topics in Optimization. Some slides borrowed from
Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model
More informationOptimization: Nonlinear Optimization without Constraints. Nonlinear Optimization without Constraints 1 / 23
Optimization: Nonlinear Optimization without Constraints Nonlinear Optimization without Constraints 1 / 23 Nonlinear optimization without constraints Unconstrained minimization min x f(x) where f(x) is
More informationLecture 18: November Review on Primal-dual interior-poit methods
10-725/36-725: Convex Optimization Fall 2016 Lecturer: Lecturer: Javier Pena Lecture 18: November 2 Scribes: Scribes: Yizhu Lin, Pan Liu Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:
More informationAnnouncements Kevin Jamieson
Announcements Project proposal due next week: Tuesday 10/24 Still looking for people to work on deep learning Phytolith project, join #phytolith slack channel 2017 Kevin Jamieson 1 Gradient Descent Machine
More informationThe Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017
The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the
More informationAM 205: lecture 19. Last time: Conditions for optimality, Newton s method for optimization Today: survey of optimization methods
AM 205: lecture 19 Last time: Conditions for optimality, Newton s method for optimization Today: survey of optimization methods Quasi-Newton Methods General form of quasi-newton methods: x k+1 = x k α
More informationLecture 14: October 17
1-725/36-725: Convex Optimization Fall 218 Lecture 14: October 17 Lecturer: Lecturer: Ryan Tibshirani Scribes: Pengsheng Guo, Xian Zhou Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:
More informationOn Lagrange multipliers of trust region subproblems
On Lagrange multipliers of trust region subproblems Ladislav Lukšan, Ctirad Matonoha, Jan Vlček Institute of Computer Science AS CR, Prague Applied Linear Algebra April 28-30, 2008 Novi Sad, Serbia Outline
More informationLecture 7 Unconstrained nonlinear programming
Lecture 7 Unconstrained nonlinear programming Weinan E 1,2 and Tiejun Li 2 1 Department of Mathematics, Princeton University, weinan@princeton.edu 2 School of Mathematical Sciences, Peking University,
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationA Quasi-Newton Algorithm for Nonconvex, Nonsmooth Optimization with Global Convergence Guarantees
Noname manuscript No. (will be inserted by the editor) A Quasi-Newton Algorithm for Nonconvex, Nonsmooth Optimization with Global Convergence Guarantees Frank E. Curtis Xiaocun Que May 26, 2014 Abstract
More informationWhy should you care about the solution strategies?
Optimization Why should you care about the solution strategies? Understanding the optimization approaches behind the algorithms makes you more effectively choose which algorithm to run Understanding the
More information