Gradient Methods Using Momentum and Memory

Similar documents
NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

Higher-Order Methods

The Steepest Descent Algorithm for Unconstrained Optimization

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Optimization methods

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

The Conjugate Gradient Method

Written Examination

7.2 Steepest Descent and Preconditioning

CPSC 540: Machine Learning

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION

Unconstrained minimization of smooth functions

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

Line Search Methods for Unconstrained Optimisation

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Optimized first-order minimization methods

Lecture 5: September 12

TMA4180 Solutions to recommended exercises in Chapter 3 of N&W

Nonlinear Optimization Methods for Machine Learning

Conditional Gradient (Frank-Wolfe) Method

A Conservation Law Method in Optimization

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Hamiltonian Descent Methods

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

8 Numerical methods for unconstrained problems

Convex Optimization. Problem set 2. Due Monday April 26th

Lecture 2: Linear Algebra Review

, b = 0. (2) 1 2 The eigenvectors of A corresponding to the eigenvalues λ 1 = 1, λ 2 = 3 are

R-Linear Convergence of Limited Memory Steepest Descent

ECE580 Partial Solution to Problem Set 3

Matrix Derivatives and Descent Optimization Methods

Symmetric Matrices and Eigendecomposition

Conjugate gradient algorithm for training neural networks

Non-convex optimization. Issam Laradji

automating the analysis and design of large-scale optimization algorithms

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

PETROV-GALERKIN METHODS

Iterative solvers for linear equations

Lecture 5 : Projections

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

10-725/36-725: Convex Optimization Prerequisite Topics

Introduction to gradient descent

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

CPSC 540: Machine Learning

Conjugate Gradient (CG) Method

Numerical Methods - Numerical Linear Algebra

How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India

Lecture 22. r i+1 = b Ax i+1 = b A(x i + α i r i ) =(b Ax i ) α i Ar i = r i α i Ar i

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Convex Optimization. 9. Unconstrained minimization. Prof. Ying Cui. Department of Electrical Engineering Shanghai Jiao Tong University

Chapter 8 Gradient Methods

C&O367: Nonlinear Optimization (Winter 2013) Assignment 4 H. Wolkowicz

Linearization of Differential Equation Models

March 8, 2010 MATH 408 FINAL EXAM SAMPLE

Minimization of Static! Cost Functions!

Optimization Tutorial 1. Basic Gradient Descent

Systems of Linear ODEs

Newton s Method. Javier Peña Convex Optimization /36-725

Math (P)refresher Lecture 8: Unconstrained Optimization

You should be able to...

Convex Optimization. (EE227A: UC Berkeley) Lecture 6. Suvrit Sra. (Conic optimization) 07 Feb, 2013

arxiv: v1 [math.oc] 1 Jul 2016

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Composite nonlinear models at scale

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Subgradient Method. Ryan Tibshirani Convex Optimization

Iterative solvers for linear equations

Search Directions for Unconstrained Optimization

Numerical Algorithms as Dynamical Systems

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

Trajectory-based optimization

Nonlinear Optimization: What s important?

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

Iterative solvers for linear equations

In English, this means that if we travel on a straight line between any two points in C, then we never leave C.

Numerical Optimization

GRADIENT = STEEPEST DESCENT

Optimization methods

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Lecture 14: October 17

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

Duke University, Department of Electrical and Computer Engineering Optimization for Scientists and Engineers c Alex Bronstein, 2014

Lecture 6: September 12

Linear Algebra Exercises

Non-negative Matrix Factorization via accelerated Projected Gradient Descent

Maximum Likelihood Estimation

A Quick Tour of Linear Algebra and Optimization for Machine Learning

Matrices A brief introduction

10. Unconstrained minimization

ORIE 6326: Convex Optimization. Quasi-Newton Methods

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization

x k+1 = x k + α k p k (13.1)

Nonlinear Optimization

Notes on Some Methods for Solving Linear Systems

Deep Learning II: Momentum & Adaptive Step Size

Tutorials in Optimization. Richard Socher

Transcription:

Chapter 3 Gradient Methods Using Momentum and Memory The steepest descent method described in Chapter always steps in the negative gradient direction, which is orthogonal to the boundary of the level set for f at the current iterate This direction can change sharply from one iteration to the next For example, when the contours of f are narrow and elongated, this strategy can produce wild swings in direction, and small steps with only slow progress toward a solution The steepest descent method is greedy in a sense It uses a direction that is always the most productive direction at this particular iterate, making no explicit use of knowledge gained about the function f at earlier iterations In this chapter, we examine methods that encode knowledge of the function in various ways, and exploit this knowledge in their choice of search directions and step lengths One such class of techniques makes use of momentum, in which the search direction tends to be similar to that one used on the previous step, with a small tweak in the direction of a negative gradient evaluated at the current point or a nearby point Each search direction is thus a combination of all gradients encountered so far during the search a compact encoding of the history of the search Momentum methods in common use include the heavyball method, the conjugate gradient method, and Nesterov s accelerated gradient methods We will also consider model-based methods, which construct an explicit model of the function f from information gathered at previous iterations, and uses this model to determine the search direction 31 Motivation from Differential Equations An intuition for momentum methods comes from looking at an optimization algorithm as a dynamical system By taking the continuous limit of an algorithm, it often traces out the solution path of a differential equation For instance, the gradient method is akin to moving down a potential well where the potential is defined by the gradient of f: dx dt = fx) This differential equation has fixed points precisely when fx) = 0, which are minimizers of a convex smooth function f 31

There are other differential equations whose fixed points occur precisely at points for which fx) = 0 Consider the second-order differential equation that governs a particle with mass moving in a potential defined by the gradient of f: µ d x = fx) bdx dt dt 31) As before, points x for which fx) = 0 are fixed points of this ODE In this setting µ governs the mass of the particle and b governs the friction dissipated during the evolution of the system If b = 0, then the system may always gain acceleration Trajectories tend to continue to move in the direction they were moving before, and heavier objects move down hill faster than light objects in the presence of friction Note that in the limit as the mass µ goes to zero, we recover the ODE that we had derived for the gradient method If we approximate this ODE using simple finite differences, we have xt + t) xt) + xt t) xt + t) xt) µ t fxt)) b t Rearranging the terms in this expression and redefining parameters gives the finite difference equation: xt + t) = xt) α fxt)) + βxt) xt t)) 3) The algorithm defined by 3) is exactly Heavy-Ball Method of Polyak, for certain choices of the parameters α and β With a minor modification, this becomes Nesterov s optimal method When f is a convex quadratic, is known also as Chebyshev s iterative method By rewriting 3) in terms of discrete iterates, we obtain Defining x k+1 = x k α fx k ) + βx k x k 1 ), 33) p k = x k+1 x k = α fx k ) + βx k x k 1 ) = α fx k ) + βp k 1 With this identification, we can rewrite the iteration in terms of two sequences: x k+1 = x k + p k p k = α fx k ) + βp k 1 Nesterov s optimal method also known as Nesterov s accelerated method) is defined by the formula x k+1 = x k α fx k + βx k x k 1 )) + βx k x k 1 ) 34) This method only differs in the way that the underlying ODE 31) is discretized In the two-state version, Nesterov s method becomes x k+1 = x k + p k p k = α fx k + βp k 1 ) + βp k 1 By a change of variable, Nesterov s optimal method can also be written in the following equivalent form: y k = x k + β k x k x k 1 ) x k+1 = y k α k fy k ) 3

3 Heavy-Ball Method In this section, we analyze the convergence behavior of the heavy-ball method 33), and derive suitable values for its parameters α and β Our development is done in terms of an inhomogeneous convex quadratic objective with positive definite Hessian Q and eigenvalues fx) = 1 xt Qx b T x + c, 35) 0 < m = λ n λ n 1 λ λ 1 = L Note that Qx = b at the minimizer x of f We ll show a powerful linear convergence result for the heavy-ball method on the quadratic function 35) But we need to be careful! This result does not imply similar convergence behavior on all strongly sonvex quadratics! A simple one-dimensional example from the Exercises serves to demonstrate the distinction By substituting fx k ) = Qx k b = Qx k x ) in 33), and sprinkling x liberally throughout this expression, we obtain ) x k+1 x = x k x ) αqx k x ) + β x k x ) x k 1 x ) 36) By concatenating the error vector x k x over two successive steps, we obtain [ x k+1 x ] [ ] [ 1 + β)i αq βi x x k x = k x ] I 0 x k 1 x By defining [ w k x := k+1 x x k x we can write the iteration 37) as ], T := [ ] 1 + β)i αq βi I 0 37) 38) w k = T w k 1, k = 1,, 39) The properties of T are key to analyzing the convergence of the sequence {w k } to zero The following theorem shows how the eigenvalues of T depend on α, β, and the eigenvalues λ i of Q Theorem 31 If we choose β such that 1 > β > max 1 αm, 1 αl ), 310) the matrix T has all complex eigenvalues, which are as follows: λ i,1 = 1 [1 + β αλ i ) + i ] 4β 1 + β αλ i ), λ i, = 1 [1 + β αλ i ) i ] 4β 1 + β αλ i ), i = 1,,, n We have λ i,1 λ i, for all i, and all eigenvalues have magnitude β 33

Proof We write the eigenvalue decomposition of Q as Q = UΛU T, where Λ = diagλ 1, λ,, λ n ) By definining the permutation matrix Π as follows: 1 i odd, j = i + 1)/ Π ij = 1 i even, j = n + i/) 0 otherwise we have by applying a similarity transformation to the matrix T that [ ] T [ ] [ ] [ ] U 0 1 + β)i αq βi U 0 Π Π T 1 + β)i αλ βi = Π Π T 0 U I 0 0 U I 0 T 1 0 0 0 T 0 =, 0 0 T n where T i = [ ] 1 + β αλi β, i = 1,,, n 1 0 The eigenvalues of T are the eigenvalues of T i, for i = 1,,, n The eigenvalues of T i are the roots of the following quadratic: u 1 + β αλ i )u + β = 0, which are given by the familiar formula: u = 1 [ 1 + β αλ i ) ± ] 1 + β αλ i ) 4β The two roots are distinct complex numbers when 1 + β αλ i ) 4β < 0, which happens when β 1 αλ i ), 1 + αλ i ) ) 311) When β is in the range 310), it satisfies 311) for all i Hence, the eigenvalues of each T i are two distinct complex numbers, for all i, and have the values λ i,1, λ i, defined above It is easy to check that all λ i,1, λ i,, i = 1,,, n, have magnitude β Because λ i,1 λ i, for the chosen value of β, there exists a complex) eigenvalue decomposition of each T i, of the form ] [ λi,1 Ui 1 0 T i U i = S i, S i = 0 λi, In fact, by composing U i, = 1,,, n with the orthgonal matrices U and Π defined in the proof above, we can define a n) n) nonsingular matrix V such that S 1 0 0 V 1 0 S 0 T V = S, where S := 0 0 S n 34

Therefore, by defining we can express the relation 39) as z k := V 1 w k, 31) z k = Sz k 1 = S z k = = S k z 0 313) Since S is a diagonal matrix whose diagonal elements all have magnitude β, we have that z k = β k z 0, 314) where z = z H z When we choose we find that α := 1 αm = 1 αl = 4 L + ), 315) L L + ), 316) so a choice of β that satisfies 310) would be β := L L + 317) Note that this value of β lies strictly between the lower bound 316) and the upper bound of 1, as required by 310)) When we use κ = L/m to denote the condition number of Q, we have β = 1 κ + 1 318) We conclude with the following result concerning R-linear convergence of x k to x Theorem 3 For the choices 315) of α and 317) β, the heavy-ball iteration 33) produces a sequence {x k } that converges R-linearly to x with factor β = 1 / κ + 1) Proof We have from 38), 31), and 314) that proving the result x k x w k V z k = β k V z 0, We have monotonic decrease in the norm of the transformed vector z k from 313)) but not the original error vector w k In fact, experiments can show sharp increases in w k on early iterations before the R-linear convergence promised by Theorem 3 takes effect Let us compare the linear convergence of heavy ball against steepest descent, on nonconvex quadratics Recall from 18) that the steepest-descent method with constant step α = 1/L requires OL/m) log ɛ) iterations to obtain a reduction of factor ɛ in the function error fx k ) f The rate defined by β in Theorem 3 suggests a complexity of O L/m log ɛ) to obtain a reduction of factor ɛ in w k a different quantity) For problems in which the condition number κ = L/m is moderate to large, the heavy-ball method has a significant advantage For example, if κ = 1000, the improved rate translates into a factor-of-30 reduction in number of iterations required, with virtually identical workload per iteration one gradient evaluation and a few vector operations) 35