Stochastic Quasi-Newton Methods
|
|
- Jane Lane
- 6 years ago
- Views:
Transcription
1 Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, / 35
2 Outline Stochastic Approximation Stochastic Gradient Descent Variance Reduction Techniques Newton-like and quasi-newton methods for convex stochastic optimization problems using limited memory block BFGS updates. Numerical results on problems from machine learning. Quasi-Newton methods for nonconvex stochastic optimization problems using damped and modified limited memory BFGS updates. 2 / 35
3 Stochastic optimization Stochastic optimization min f (x) = E[f (x, ξ)], ξ is random variable Or finite sum (with f i (x) f (x, ξ i ) for i = 1,..., n and very large n) min f (x) = 1 n f i (x) n f and f are very expensive to evaluate; stochastic gradient descent (SGD) methods choose a random subset S [n] and evaluate f S (x) = 1 S i S i=1 f i (x) and f S (x) = 1 S f i (x) Essentially, only noisy info about f, f and 2 f is available Challenge: how to smooth variability of stochastic methods Challenge: how to design methods that take advantage of noisy 2nd-order information? i S 3 / 35
4 Stochastic optimization Deterministic gradient method Stochastic gradient method 4 / 35
5 Stochastic Variance Reduced Gradients Stochastic methods converge slowly near the optimum due to the variance of the gradient estimates f S (x); hence requiring a decreasing step size. We use the control variates approach of Johnson and Zhang (2013) for a SGD method SVRG. It uses d = f S (x t ) f S (w k ) + f (w k ), where w k is a reference point, in place of f S (x t ). w k, and the full gradient, are computed after each full pass of the data, hence doubling the work of computing stochastic gradients. 5 / 35
6 Stochastic Average Gradient At iteration t - Sample i from {1,..., N} - update y t+1 i = f i (x t ) and y t+1 j - Compute g t+1 = 1 N N j=1 y t+1 i - Set x t+1 = x t α t+1 g t+1 = y t j for all j i Provable linear convergence in expectation. Other SGD variance reduction techniques have been recently proposes including the methods: SAGA, SDCA, S2GD. 6 / 35
7 Quasi-Newton Method for min f (x) : f C 1 Gradient method: Newton s method: Quasi-Newton method: x k+1 = x k α k f (x k ) x k+1 = x k α k [ 2 f (x k )] 1 f (x k ) x k+1 = x k α k B 1 k f (x k) where B k 0 approximates the Hessian matrix Update B k+1 s k = y k, (Secant equation) where s k = x k+1 x k = α k d k, and y k = f k+1 f k 7 / 35
8 BFGS BFGS quasi-newton method B k+1 = B k + y k y k s k y k B ks k s k B k s k B ks k where s k := x k+1 x k and y k := f (x k+1 ) f (x k ) B k+1 0 if B k 0 and s k y k > 0 (curvature condition) Secant equation has a solution if s k y k > 0 When f is strongly convex, s k y k > 0 holds automatically If f is nonconvex, use line search to guarantee s k y k > 0 H k+1 = (I s kyk sk y )H k (I y ksk k sk y ) + s ksk k sk y k 8 / 35
9 Prior work on Quasi-Newton Methods for Stochastic Optimization P1 N.N. Schraudolph, J. Yu and S.Günter. A stochastic quasi-newton method for online convex optim. Int l. Conf. AI & Stat., 2007 Modifies BFGS and L-BFGS updates by reducing the step s k and the last term in the update of H k, uses step size α k = β/k for small β > 0. P2 A. Bordes, L. Bottou and P. Gallinari. SGD-QN: Careful quasi-newton stochastic gradient descent. JMLR vol. 10, 2009 Uses a diagonal matrix approximation to [ 2 f ( )] 1 which is updated (hence, the name SGD-QN) on each iteration, α k = 1/(k + α). 9 / 35
10 Prior work on Quasi-Newton Methods for Stochastic Optimization P3 A. Mokhtari and A. Ribeiro. RES: Regularized stochastic BFGS algorithm. IEEE Trans. Signal Process., no. 10, Replaces y k by y k δs k for some δ > 0 in BFGS update and also adds δi to the update; uses α k = β/k; converges in expectation at sub-linear rate E(f (x k ) f ) C/k P4 A. Mokhtari and A. Ribeiro. Global convergence of online limited memory BFGS. to appear in J. Mach. Learn. Res., Uses L-BFGS without regularization and α k = β/k; converges in expectation at sub-linear rate E(f (x k ) f ) C/k 10 / 35
11 Prior work on Quasi-Newton Methods for Stochastic Optimization P5 R.H. Byrd, S.L. Hansen, J. Nocedal, and Y. Singer. A stochastic quasi-newton method for large-scale optim. arxiv v2, 2015 Averages iterates over L steps keeping H k fixed; uses average iterates to update H k using subsampled Hessian to compute y k ; α k = β/k; converges in expectation at a sub-linear rate E(f (x k ) f ) C/k P6 P. Moritz, R. Nishihara, M.I. Jordan. A linearly-convergent stochastic L-BFGS Algorithm, 2015 arxiv: v1 Combines [P5] with SVRG; uses fixed step size α; converges in expectation at a linear rate. 11 / 35
12 Using Stochastic 2nd-order information Assumption: f (x) = 1 n n i=1 f i(x) is strongly convex and twice continuously differentiable. Choose (compute) a sketching matrix S k (the columns of S k are a set of directions). We do not use differences in noisy gradients to estimate curvature, but rather compute the action of the sub-sampled Hessian on S k. i.e., compute Y k = 1 T i T 2 f i (x)s k, where T [n]. 12 / 35
13 Example of Hessian-Vector Computation In binary classification problem, sample function (logistic loss) where f (w; x i, z i ) = z i log(c(w; x i )) + (1 z i ) log(1 c(w; x i )) c(w; x i ) = exp( x i w), x i R n, w R n, z i {0, 1}, Gradient: f (w; x i, z i ) = (c(w; x i ) z i )x i Action of Hessian on s : 2 f (w; x i, z i )s = c(w; x i )(1 c(w; x i ))(x i s)x i 13 / 35
14 block BFGS The block BFGS method computes a least change update to the current approximation H k to the inverse Hessian matrix 2 f (x) at the current point x, by solving min H H k s.t., H = H, HY k = S k. where A = ( 2 f (x k )) 1/2 A( 2 f (x k )) 1/2 F (F = Frobenius) This gives the updating formula (analgous to the updates derived by Broyden, Fletcher, Goldfarb and Shanno, 1970). H k+1 = (I S k [S k Y k] 1 Y k )H k(i Y k [S k Y k] 1 S k )+S k[s k Y k] 1 S k or, by the Sherman-Morrison-Woodbury formula: B k+1 = B k B k S k [S k B ks k ] 1 S k B k + Y k [S k Y k] 1 Y k 14 / 35
15 Limited Memory Block BFGS After M block BFGS steps starting from H k+1 M, one can express H k+1 as H k+1 = V k H k V T k + S kλ k S T k. = V k V k 1 H k 1 V T k 1 V k + V k S k 1 Λ k 1 S T k 1 V T k + S kλ k S T k k+1 M = V k:k+1 M H k+1 M Vk:k+1 M T + i=k V k:i+1 S i Λ i S T i V T k:i+1, where V k = (I S k Λ k Y T k ) (1) and Λ k = (S T k Y k) 1 and V k:i = V k V i. 15 / 35
16 Limited Memory Block BFGS Intuition Hence, when the number of variables d is large, instead of storing the d d matrix H k, we store the previous M block curvature triples (S k+1 M, Y k+1 M, Λ k+1 M ),..., (S k, Y k, Λ k ). Then, analogously to the standard L-BFGS method, for any vector v R d, H k v can be computed efficiently using a two-loop block recursion (in O(Mp(d + p) + p 3 ) operations), if all S i R d p. Limited memory - least change aspect of BFGS is important Each block update acts like a sketching procedure. 16 / 35
17 Two Loop Recursion 17 / 35
18 Choices for the Sketching Matrix S k We employ one of the following strategies Gaussian: S k N (0, I ) has Gaussian entries sampled i.i.d at each iteration. Previous search directions s i delayed: Store the previous L search directions S k = [s k+1 L,..., s k ] then update H k only once every L iterations. Self-conditioning: Sample the columns of the Cholesky factors L k of H k (i.e., L k L T k = H k) uniformly at random. Fortunately we can maintain and update L k efficiently with limited memory. The matrix S is a sketching matrix, in the sense that we are sketching the, possibly very large equation 2 f (x)h = I to which the solution is the inverse Hessian. Right multiplying by S compresses/sketches the equation yielding 2 f (x)hs = S. 18 / 35
19 The Basic Algorithm 19 / 35
20 Convergence - Assumptions There exist constants λ, Λ R + such that f is λ strongly convex f (w) f (x) + f (x) T (w x) + λ 2 w x 2 2, (2) f is Λ smooth f (w) f (x) + f (x) T (w x) + Λ 2 w x 2 2, (3) These assumptions imply that λi 2 f S (w) ΛI, for all x R d, S [n], (4) from which we can prove that there exist constants γ, Γ R + such that for all k we have γi H k ΓI. (5) 20 / 35
21 Bounds on Spectrum of H k Lemma Assuming 0 < λ < Λ such that for all x R d and T [n], where λi 2 f T (x) ΛI γi H k ΓI 1 1+MΛ γ, Γ (1 + κ) 2M 1 (1 + λ(2 ) and κ = Λ/λ κ+κ) bounds in MNJ depend on problem dimension 1 (d+m)λ γ and Γ [(d+m)λ]d+m 1 λ d+m (dκ) d+m 21 / 35
22 Linear Convergence Theorem Suppose that the Assumptions hold. Let w be the unique minimizer of f (w). Then in our Algorithm, we have for all k 0 that Ef (w k ) f (w ) ρ k Ef (w 0 ) f (w ), where the convergence rate is given by ρ = 1/2mη + ηγ2 Λ(Λ λ) γλ ηγ 2 Λ 2 < 1, assuming we have chosen η < γλ/(2γ 2 Λ 2 ) and that we choose m large enough to satisfy m 1 2η (γλ ηγ 2 Λ(2Λ λ)), which is a positive lower bound given our restriction on η. 22 / 35
23 Numerical Experiments Empirical Risk Minimization Test Problems logistic loss with l 2 regularizer min w n log(1 + exp( y i a i, w )) + L w 2 2 i=1 given data: A = [a 1, a 2,, a n ] R d n y {0, 1} n. For each method, chose step size η {1,.5,.1,.05,..., , 10 8 } that gave best results Computed full gradient after each full data pass. Vertical axis in figures below: log(relative error) 23 / 35
24 error gisette-scale d = 5, 000, n = 6, 000 gauss_18_m_5 prev_18_m_5 fact_18_m_3 MNJ_bH_330 SVRG time (s) 24 / 35
25 error covtype-libsvm-binary d = 54, n = 581, 012 gauss_8_m_5 prev_8_m_5 fact_8_m_3 MNJ_bH_3815 SVRG time (s) 25 / 35
26 error Higgs d = 28, n = 11, 000, 000 gauss_4_m_5 prev_4_m_5 fact_4_m_3 MNJ_bH_16585 SVRG time (s) 26 / 35
27 error SUSY d = 18, n = 3, 548, 466 gauss_5_m_5 prev_5_m_5 fact_5_m_3 MNJ_bH_9420 SVRG time (s) 27 / 35
28 error epsilon-normalized d = 2, 000, n = 400, 000 gauss_45_m_5 prev_45_m_5 fact_45_m_3 MNJ_bH_3165 SVRG time (s) 28 / 35
29 error rcv1-training d = 47, 236, n = 20, 242 gauss_10_m_5 prev_10_m_5 fact_10_m_3 MNJ_bH_715 SVRG time (s) 29 / 35
30 error url-combined d = 3, 231, 961, n = 2, 396, 130 gauss_2_m_5 prev_2_m_5 fact_2_m_3 MNJ_bH_7740 SVRG time (s) 30 / 35
31 Contributions New metric learning framework. A block BFGS framework for gradually learning the metric of the underlying function using a sketched form of the subsampled Hessian matrix New limited memory block BFGS method. May also be of interest for non-stochastic optimization Several sketching matrix possibilities. More reasonable bounds on eigenvalues of H k more reasonable conditions for step size 31 / 35
32 Nonconvex stochastic optimization Most stochastic quasi-newton optimization methods are for strongly convex problems; this is needed to ensure a curvature condition required for the positive definiteness of B k (H k ) This is not possible for problems min f (x) E[F (x, ξ)], where f is nonconvex In deterministic setting, one can do line search to guarantee the curvature condition, and hence the positive definiteness of B k (H k ) Line search is not possible for stochastic optimization To address these issues we develop a stochastic damped and a stochastic modified L-BFGS method. 32 / 35
33 Stochastic Damped BFGS (Wang, Ma, G, Liu, 2015) Let y k = 1 m m i=1 ( f (x k+1, ξ k,i ) f (x k, ξ k,i )) and define ȳ k = θ k y k + (1 θ k )B k s k, where { 1, if sk θ k = y k 0.25sk B ks k, (0.75sk B ks k )/(sk B ks k sk y k), if sk y k < 0.25sk B ks k. Update H k : (replace y k by ȳ k ) H k+1 = (I ρ k s k ȳ k )H k(i ρ k ȳ k s k ) + ρ ks k s k where ρ k = 1/s k ȳk Implemented in a limited memory version Work in progress: combine with variance reduced stochastic gradients (SVRG) 33 / 35
34 Convergence of Stochastic Damped BFGS Method Assumptions [AS1] f C 1, bounded below, f is L Lipschitz continuous [AS2] For any iteration k, the stochastic gradient satisfies E ξk [ f (x k, ξ k )] = f (x k ) E ξk [ f (x k, ξ k ) f (x k ) 2 ] σ 2 Theorem (Global convergence): Assume AS1-AS2 hold, (and α k = β/k γ/(lγ 2 ) for all k), then there exist positive constants γ, Γ, such that γi H k ΓI, for all k, and lim inf k f (x k) = 0, with probability 1. Under additional assumption E ξk [ f (xk, ξ k ) 2] M lim f (x k) = 0, with probability 1. k We do not need to assume convexity of f 34 / 35
35 Block-L-BFGS Method for Non-Convex Stochastic Optimization Block-update H k+1 = (I S k Λ 1 k Y k )H k(i Y k Λ 1 k S k ) + S kλ 1 where Λ k = Sk Y k = Sk 2 f (x k )S k In non-convex case Λ k = Λ k may not be positive definite. Λ k 0 discovered while computing Cholesky factorization LDL of Λ k. If during Cholesky, d j δ or (LD 1/2 ) ij β are not satisfied, d j is increased by τ j. = (Λ k ) jj (Λ k ) jj + τ j has the effect of moving search direction H k+1 f (x k+1 ) toward one of negative curvature. k S k Modification based on Gershgorin disc also possible. 35 / 35
Stochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationQuasi-Newton Methods. Javier Peña Convex Optimization /36-725
Quasi-Newton Methods Javier Peña Convex Optimization 10-725/36-725 Last time: primal-dual interior-point methods Consider the problem min x subject to f(x) Ax = b h(x) 0 Assume f, h 1,..., h m are convex
More informationQuasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno February 6, / 25 (BFG. Limited memory BFGS (L-BFGS)
Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno (BFGS) Limited memory BFGS (L-BFGS) February 6, 2014 Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb
More informationHigh Order Methods for Empirical Risk Minimization
High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu Thanks to: Aryan Mokhtari, Mark
More informationQuasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization
Quasi-Newton Methods Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization 10-725 Last time: primal-dual interior-point methods Given the problem min x f(x) subject to h(x)
More informationIncremental Quasi-Newton methods with local superlinear convergence rate
Incremental Quasi-Newton methods wh local superlinear convergence rate Aryan Mokhtari, Mark Eisen, and Alejandro Ribeiro Department of Electrical and Systems Engineering Universy of Pennsylvania Int. Conference
More informationAn Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal
An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationHigh Order Methods for Empirical Risk Minimization
High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu IPAM Workshop of Emerging Wireless
More informationSecond-Order Methods for Stochastic Optimization
Second-Order Methods for Stochastic Optimization Frank E. Curtis, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More information5 Quasi-Newton Methods
Unconstrained Convex Optimization 26 5 Quasi-Newton Methods If the Hessian is unavailable... Notation: H = Hessian matrix. B is the approximation of H. C is the approximation of H 1. Problem: Solve min
More informationAccelerating SVRG via second-order information
Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical
More informationShiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method
Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 3 Gradient Method Shiqian Ma, MAT-258A: Numerical Optimization 2 3.1. Gradient method Classical gradient method: to minimize a differentiable convex
More informationImproving L-BFGS Initialization for Trust-Region Methods in Deep Learning
Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Jacob Rafati http://rafati.net jrafatiheravi@ucmerced.edu Ph.D. Candidate, Electrical Engineering and Computer Science University
More informationHigh Order Methods for Empirical Risk Minimization
High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu IPAM Workshop of Emerging Wireless
More informationSub-Sampled Newton Methods
Sub-Sampled Newton Methods F. Roosta-Khorasani and M. W. Mahoney ICSI and Dept of Statistics, UC Berkeley February 2016 F. Roosta-Khorasani and M. W. Mahoney (UCB) Sub-Sampled Newton Methods Feb 2016 1
More informationProgramming, numerics and optimization
Programming, numerics and optimization Lecture C-3: Unconstrained optimization II Łukasz Jankowski ljank@ippt.pan.pl Institute of Fundamental Technological Research Room 4.32, Phone +22.8261281 ext. 428
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More information2. Quasi-Newton methods
L. Vandenberghe EE236C (Spring 2016) 2. Quasi-Newton methods variable metric methods quasi-newton methods BFGS update limited-memory quasi-newton methods 2-1 Newton method for unconstrained minimization
More informationSub-Sampled Newton Methods for Machine Learning. Jorge Nocedal
Sub-Sampled Newton Methods for Machine Learning Jorge Nocedal Northwestern University Goldman Lecture, Sept 2016 1 Collaborators Raghu Bollapragada Northwestern University Richard Byrd University of Colorado
More informationImproved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations
Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org
More informationA Stochastic Quasi-Newton Method for Large-Scale Optimization
A Stochastic Quasi-Newton Method for Large-Scale Optimization R. H. Byrd S.L. Hansen Jorge Nocedal Y. Singer February 17, 2015 Abstract The question of how to incorporate curvature information in stochastic
More informationLecture 1: Supervised Learning
Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)
More informationBarzilai-Borwein Step Size for Stochastic Gradient Descent
Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong
More informationSub-Sampled Newton Methods I: Globally Convergent Algorithms
Sub-Sampled Newton Methods I: Globally Convergent Algorithms arxiv:1601.04737v3 [math.oc] 26 Feb 2016 Farbod Roosta-Khorasani February 29, 2016 Abstract Michael W. Mahoney Large scale optimization problems
More informationMethods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent
Nonlinear Optimization Steepest Descent and Niclas Börlin Department of Computing Science Umeå University niclas.borlin@cs.umu.se A disadvantage with the Newton method is that the Hessian has to be derived
More informationORIE 6326: Convex Optimization. Quasi-Newton Methods
ORIE 6326: Convex Optimization Quasi-Newton Methods Professor Udell Operations Research and Information Engineering Cornell April 10, 2017 Slides on steepest descent and analysis of Newton s method adapted
More informationHigher-Order Methods
Higher-Order Methods Stephen J. Wright 1 2 Computer Sciences Department, University of Wisconsin-Madison. PCMI, July 2016 Stephen Wright (UW-Madison) Higher-Order Methods PCMI, July 2016 1 / 25 Smooth
More informationParallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence
Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)
More informationSecond order machine learning
Second order machine learning Michael W. Mahoney ICSI and Department of Statistics UC Berkeley Michael W. Mahoney (UC Berkeley) Second order machine learning 1 / 88 Outline Machine Learning s Inverse Problem
More informationSecond-Order Stochastic Optimization for Machine Learning in Linear Time
Journal of Machine Learning Research 8 (207) -40 Submitted 9/6; Revised 8/7; Published /7 Second-Order Stochastic Optimization for Machine Learning in Linear Time Naman Agarwal Computer Science Department
More informationConvex Optimization CMU-10725
Convex Optimization CMU-10725 Quasi Newton Methods Barnabás Póczos & Ryan Tibshirani Quasi Newton Methods 2 Outline Modified Newton Method Rank one correction of the inverse Rank two correction of the
More informationSolving linear equations with Gaussian Elimination (I)
Term Projects Solving linear equations with Gaussian Elimination The QR Algorithm for Symmetric Eigenvalue Problem The QR Algorithm for The SVD Quasi-Newton Methods Solving linear equations with Gaussian
More informationNonlinear Optimization Methods for Machine Learning
Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks
More informationStochastic Gradient Methods for Large-Scale Machine Learning
Stochastic Gradient Methods for Large-Scale Machine Learning Leon Bottou Facebook A Research Frank E. Curtis Lehigh University Jorge Nocedal Northwestern University 1 Our Goal 1. This is a tutorial about
More informationMath 273a: Optimization Netwon s methods
Math 273a: Optimization Netwon s methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 some material taken from Chong-Zak, 4th Ed. Main features of Newton s method Uses both first derivatives
More informationBarzilai-Borwein Step Size for Stochastic Gradient Descent
Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how
More informationTrade-Offs in Distributed Learning and Optimization
Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed
More informationMaximum Likelihood Estimation
Maximum Likelihood Estimation Prof. C. F. Jeff Wu ISyE 8813 Section 1 Motivation What is parameter estimation? A modeler proposes a model M(θ) for explaining some observed phenomenon θ are the parameters
More informationStochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure
Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,
More informationABSTRACT 1. INTRODUCTION
A DIAGONAL-AUGMENTED QUASI-NEWTON METHOD WITH APPLICATION TO FACTORIZATION MACHINES Aryan Mohtari and Amir Ingber Department of Electrical and Systems Engineering, University of Pennsylvania, PA, USA Big-data
More informationLBFGS. John Langford, Large Scale Machine Learning Class, February 5. (post presentation version)
LBFGS John Langford, Large Scale Machine Learning Class, February 5 (post presentation version) We are still doing Linear Learning Features: a vector x R n Label: y R Goal: Learn w R n such that ŷ w (x)
More informationQuasi-Newton methods for minimization
Quasi-Newton methods for minimization Lectures for PHD course on Numerical optimization Enrico Bertolazzi DIMS Universitá di Trento November 21 December 14, 2011 Quasi-Newton methods for minimization 1
More informationVariable Metric Stochastic Approximation Theory
Variable Metric Stochastic Approximation Theory Abstract We provide a variable metric stochastic approximation theory. In doing so, we provide a convergence theory for a large class of online variable
More informationSVRG++ with Non-uniform Sampling
SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract
More informationLinear Convergence under the Polyak-Łojasiewicz Inequality
Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based
More informationStochastic Optimization Methods for Machine Learning. Jorge Nocedal
Stochastic Optimization Methods for Machine Learning Jorge Nocedal Northwestern University SIAM CSE, March 2017 1 Collaborators Richard Byrd R. Bollagragada N. Keskar University of Colorado Northwestern
More informationSecond order machine learning
Second order machine learning Michael W. Mahoney ICSI and Department of Statistics UC Berkeley Michael W. Mahoney (UC Berkeley) Second order machine learning 1 / 96 Outline Machine Learning s Inverse Problem
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More information1. Search Directions In this chapter we again focus on the unconstrained optimization problem. lim sup ν
1 Search Directions In this chapter we again focus on the unconstrained optimization problem P min f(x), x R n where f : R n R is assumed to be twice continuously differentiable, and consider the selection
More informationLecture 18: November Review on Primal-dual interior-poit methods
10-725/36-725: Convex Optimization Fall 2016 Lecturer: Lecturer: Javier Pena Lecture 18: November 2 Scribes: Scribes: Yizhu Lin, Pan Liu Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationLARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION
LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION Aryan Mokhtari, Alec Koppel, Gesualdo Scutari, and Alejandro Ribeiro Department of Electrical and Systems
More informationBlock stochastic gradient update method
Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic
More informationImproving the Convergence of Back-Propogation Learning with Second Order Methods
the of Back-Propogation Learning with Second Order Methods Sue Becker and Yann le Cun, Sept 1988 Kasey Bray, October 2017 Table of Contents 1 with Back-Propagation 2 the of BP 3 A Computationally Feasible
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationEfficient Methods for Large-Scale Optimization
Efficient Methods for Large-Scale Optimization Aryan Mokhtari Department of Electrical and Systems Engineering University of Pennsylvania aryanm@seas.upenn.edu Ph.D. Proposal Advisor: Alejandro Ribeiro
More informationConvex Optimization. Problem set 2. Due Monday April 26th
Convex Optimization Problem set 2 Due Monday April 26th 1 Gradient Decent without Line-search In this problem we will consider gradient descent with predetermined step sizes. That is, instead of determining
More informationStochastic Gradient Descent with Variance Reduction
Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationmin f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;
Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many
More informationLinearly-Convergent Stochastic-Gradient Methods
Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure
More informationImproved Damped Quasi-Newton Methods for Unconstrained Optimization
Improved Damped Quasi-Newton Methods for Unconstrained Optimization Mehiddin Al-Baali and Lucio Grandinetti August 2015 Abstract Recently, Al-Baali (2014) has extended the damped-technique in the modified
More informationDistributed Inexact Newton-type Pursuit for Non-convex Sparse Learning
Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology
More informationLecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.
Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods
More informationMethods for Unconstrained Optimization Numerical Optimization Lectures 1-2
Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2 Coralia Cartis, University of Oxford INFOMM CDT: Modelling, Analysis and Computation of Continuous Real-World Problems Methods
More informationSearch Directions for Unconstrained Optimization
8 CHAPTER 8 Search Directions for Unconstrained Optimization In this chapter we study the choice of search directions used in our basic updating scheme x +1 = x + t d. for solving P min f(x). x R n All
More informationStochastic Second-Order Method for Large-Scale Nonconvex Sparse Learning Models
Stochastic Second-Order Method for Large-Scale Nonconvex Sparse Learning Models Hongchang Gao, Heng Huang Department of Electrical and Computer Engineering, University of Pittsburgh, USA hongchanggao@gmail.com,
More informationOptimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for
More informationSub-Sampled Newton Methods I: Globally Convergent Algorithms
Sub-Sampled Newton Methods I: Globally Convergent Algorithms arxiv:1601.04737v1 [math.oc] 18 Jan 2016 Farbod Roosta-Khorasani January 20, 2016 Abstract Michael W. Mahoney Large scale optimization problems
More informationStochastic Gradient Descent. Ryan Tibshirani Convex Optimization
Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so
More informationPresentation in Convex Optimization
Dec 22, 2014 Introduction Sample size selection in optimization methods for machine learning Introduction Sample size selection in optimization methods for machine learning Main results: presents a methodology
More informationOptimization Methods for Machine Learning
Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction
More informationQuasi-Newton Methods
Newton s Method Pros and Cons Quasi-Newton Methods MA 348 Kurt Bryan Newton s method has some very nice properties: It s extremely fast, at least once it gets near the minimum, and with the simple modifications
More informationNetwork Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)
Network Newton Aryan Mokhtari, Qing Ling and Alejandro Ribeiro University of Pennsylvania, University of Science and Technology (China) aryanm@seas.upenn.edu, qingling@mail.ustc.edu.cn, aribeiro@seas.upenn.edu
More informationOn Nesterov s Random Coordinate Descent Algorithms - Continued
On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower
More informationFast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee
Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer
More informationOptimization Algorithms on Riemannian Manifolds with Applications
Optimization Algorithms on Riemannian Manifolds with Applications Wen Huang Coadvisor: Kyle A. Gallivan Coadvisor: Pierre-Antoine Absil Florida State University Catholic University of Louvain November
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationFast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and
More informationIPAM Summer School Optimization methods for machine learning. Jorge Nocedal
IPAM Summer School 2012 Tutorial on Optimization methods for machine learning Jorge Nocedal Northwestern University Overview 1. We discuss some characteristics of optimization problems arising in deep
More informationApproximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo
Approximate Second Order Algorithms Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo Why Second Order Algorithms? Invariant under affine transformations e.g. stretching a function preserves the convergence
More informationOptimization and Root Finding. Kurt Hornik
Optimization and Root Finding Kurt Hornik Basics Root finding and unconstrained smooth optimization are closely related: Solving ƒ () = 0 can be accomplished via minimizing ƒ () 2 Slide 2 Basics Root finding
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationNumerical Optimization: Basic Concepts and Algorithms
May 27th 2015 Numerical Optimization: Basic Concepts and Algorithms R. Duvigneau R. Duvigneau - Numerical Optimization: Basic Concepts and Algorithms 1 Outline Some basic concepts in optimization Some
More informationFast Stochastic Optimization Algorithms for ML
Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2
More informationUnconstrained optimization
Chapter 4 Unconstrained optimization An unconstrained optimization problem takes the form min x Rnf(x) (4.1) for a target functional (also called objective function) f : R n R. In this chapter and throughout
More informationTowards stability and optimality in stochastic gradient descent
Towards stability and optimality in stochastic gradient descent Panos Toulis, Dustin Tran and Edoardo M. Airoldi August 26, 2016 Discussion by Ikenna Odinaka Duke University Outline Introduction 1 Introduction
More informationNewton s Method. Ryan Tibshirani Convex Optimization /36-725
Newton s Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, Properties and examples: f (y) = max x
More informationNonlinear Programming
Nonlinear Programming Kees Roos e-mail: C.Roos@ewi.tudelft.nl URL: http://www.isa.ewi.tudelft.nl/ roos LNMB Course De Uithof, Utrecht February 6 - May 8, A.D. 2006 Optimization Group 1 Outline for week
More informationA Stochastic Quasi-Newton Method for Online Convex Optimization
A Stochastic Quasi-Newton Method for Online Convex Optimization Nicol N. Schraudolph nic.schraudolph@nicta.com.au Jin Yu jin.yu@rsise.anu.edu.au Statistical Machine Learning, National ICT Australia Locked
More informationTrust-Region Optimization Methods Using Limited-Memory Symmetric Rank-One Updates for Off-The-Shelf Machine Learning
Trust-Region Optimization Methods Using Limited-Memory Symmetric Rank-One Updates for Off-The-Shelf Machine Learning by Jennifer B. Erway, Joshua Griffin, Riadh Omheni, and Roummel Marcia Technical report
More informationAccelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization
Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)
More informationAlgorithms for Constrained Optimization
1 / 42 Algorithms for Constrained Optimization ME598/494 Lecture Max Yi Ren Department of Mechanical Engineering, Arizona State University April 19, 2015 2 / 42 Outline 1. Convergence 2. Sequential quadratic
More informationSummary and discussion of: Dropout Training as Adaptive Regularization
Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial
More informationMATH 4211/6211 Optimization Quasi-Newton Method
MATH 4211/6211 Optimization Quasi-Newton Method Xiaojing Ye Department of Mathematics & Statistics Georgia State University Xiaojing Ye, Math & Stat, Georgia State University 0 Quasi-Newton Method Motivation:
More informationMath 408A: Non-Linear Optimization
February 12 Broyden Updates Given g : R n R n solve g(x) = 0. Algorithm: Broyden s Method Initialization: x 0 R n, B 0 R n n Having (x k, B k ) compute (x k+1, B x+1 ) as follows: Solve B k s k = g(x
More informationNumerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09
Numerical Optimization 1 Working Horse in Computer Vision Variational Methods Shape Analysis Machine Learning Markov Random Fields Geometry Common denominator: optimization problems 2 Overview of Methods
More information