Chapter 3 Gradient Methods Using Momentum and Memory The steepest descent method described in Chapter always steps in the negative gradient direction, which is orthogonal to the boundary of the level set for f at the current iterate This direction can change sharply from one iteration to the next For example, when the contours of f are narrow and elongated, this strategy can produce wild swings in direction, and small steps with only slow progress toward a solution The steepest descent method is greedy in a sense It uses a direction that is always the most productive direction at this particular iterate, making no explicit use of knowledge gained about the function f at earlier iterations In this chapter, we examine methods that encode knowledge of the function in various ways, and exploit this knowledge in their choice of search directions and step lengths One such class of techniques makes use of momentum, in which the search direction tends to be similar to that one used on the previous step, with a small tweak in the direction of a negative gradient evaluated at the current point or a nearby point Each search direction is thus a combination of all gradients encountered so far during the search a compact encoding of the history of the search Momentum methods in common use include the heavyball method, the conjugate gradient method, and Nesterov s accelerated gradient methods We will also consider model-based methods, which construct an explicit model of the function f from information gathered at previous iterations, and uses this model to determine the search direction 31 Motivation from Differential Equations An intuition for momentum methods comes from looking at an optimization algorithm as a dynamical system By taking the continuous limit of an algorithm, it often traces out the solution path of a differential equation For instance, the gradient method is akin to moving down a potential well where the potential is defined by the gradient of f: dx dt = fx) This differential equation has fixed points precisely when fx) = 0, which are minimizers of a convex smooth function f 31
There are other differential equations whose fixed points occur precisely at points for which fx) = 0 Consider the second-order differential equation that governs a particle with mass moving in a potential defined by the gradient of f: µ d x = fx) bdx dt dt 31) As before, points x for which fx) = 0 are fixed points of this ODE In this setting µ governs the mass of the particle and b governs the friction dissipated during the evolution of the system If b = 0, then the system may always gain acceleration Trajectories tend to continue to move in the direction they were moving before, and heavier objects move down hill faster than light objects in the presence of friction Note that in the limit as the mass µ goes to zero, we recover the ODE that we had derived for the gradient method If we approximate this ODE using simple finite differences, we have xt + t) xt) + xt t) xt + t) xt) µ t fxt)) b t Rearranging the terms in this expression and redefining parameters gives the finite difference equation: xt + t) = xt) α fxt)) + βxt) xt t)) 3) The algorithm defined by 3) is exactly Heavy-Ball Method of Polyak, for certain choices of the parameters α and β With a minor modification, this becomes Nesterov s optimal method When f is a convex quadratic, is known also as Chebyshev s iterative method By rewriting 3) in terms of discrete iterates, we obtain Defining x k+1 = x k α fx k ) + βx k x k 1 ), 33) p k = x k+1 x k = α fx k ) + βx k x k 1 ) = α fx k ) + βp k 1 With this identification, we can rewrite the iteration in terms of two sequences: x k+1 = x k + p k p k = α fx k ) + βp k 1 Nesterov s optimal method also known as Nesterov s accelerated method) is defined by the formula x k+1 = x k α fx k + βx k x k 1 )) + βx k x k 1 ) 34) This method only differs in the way that the underlying ODE 31) is discretized In the two-state version, Nesterov s method becomes x k+1 = x k + p k p k = α fx k + βp k 1 ) + βp k 1 By a change of variable, Nesterov s optimal method can also be written in the following equivalent form: y k = x k + β k x k x k 1 ) x k+1 = y k α k fy k ) 3
3 Heavy-Ball Method In this section, we analyze the convergence behavior of the heavy-ball method 33), and derive suitable values for its parameters α and β Our development is done in terms of an inhomogeneous convex quadratic objective with positive definite Hessian Q and eigenvalues fx) = 1 xt Qx b T x + c, 35) 0 < m = λ n λ n 1 λ λ 1 = L Note that Qx = b at the minimizer x of f We ll show a powerful linear convergence result for the heavy-ball method on the quadratic function 35) But we need to be careful! This result does not imply similar convergence behavior on all strongly sonvex quadratics! A simple one-dimensional example from the Exercises serves to demonstrate the distinction By substituting fx k ) = Qx k b = Qx k x ) in 33), and sprinkling x liberally throughout this expression, we obtain ) x k+1 x = x k x ) αqx k x ) + β x k x ) x k 1 x ) 36) By concatenating the error vector x k x over two successive steps, we obtain [ x k+1 x ] [ ] [ 1 + β)i αq βi x x k x = k x ] I 0 x k 1 x By defining [ w k x := k+1 x x k x we can write the iteration 37) as ], T := [ ] 1 + β)i αq βi I 0 37) 38) w k = T w k 1, k = 1,, 39) The properties of T are key to analyzing the convergence of the sequence {w k } to zero The following theorem shows how the eigenvalues of T depend on α, β, and the eigenvalues λ i of Q Theorem 31 If we choose β such that 1 > β > max 1 αm, 1 αl ), 310) the matrix T has all complex eigenvalues, which are as follows: λ i,1 = 1 [1 + β αλ i ) + i ] 4β 1 + β αλ i ), λ i, = 1 [1 + β αλ i ) i ] 4β 1 + β αλ i ), i = 1,,, n We have λ i,1 λ i, for all i, and all eigenvalues have magnitude β 33
Proof We write the eigenvalue decomposition of Q as Q = UΛU T, where Λ = diagλ 1, λ,, λ n ) By definining the permutation matrix Π as follows: 1 i odd, j = i + 1)/ Π ij = 1 i even, j = n + i/) 0 otherwise we have by applying a similarity transformation to the matrix T that [ ] T [ ] [ ] [ ] U 0 1 + β)i αq βi U 0 Π Π T 1 + β)i αλ βi = Π Π T 0 U I 0 0 U I 0 T 1 0 0 0 T 0 =, 0 0 T n where T i = [ ] 1 + β αλi β, i = 1,,, n 1 0 The eigenvalues of T are the eigenvalues of T i, for i = 1,,, n The eigenvalues of T i are the roots of the following quadratic: u 1 + β αλ i )u + β = 0, which are given by the familiar formula: u = 1 [ 1 + β αλ i ) ± ] 1 + β αλ i ) 4β The two roots are distinct complex numbers when 1 + β αλ i ) 4β < 0, which happens when β 1 αλ i ), 1 + αλ i ) ) 311) When β is in the range 310), it satisfies 311) for all i Hence, the eigenvalues of each T i are two distinct complex numbers, for all i, and have the values λ i,1, λ i, defined above It is easy to check that all λ i,1, λ i,, i = 1,,, n, have magnitude β Because λ i,1 λ i, for the chosen value of β, there exists a complex) eigenvalue decomposition of each T i, of the form ] [ λi,1 Ui 1 0 T i U i = S i, S i = 0 λi, In fact, by composing U i, = 1,,, n with the orthgonal matrices U and Π defined in the proof above, we can define a n) n) nonsingular matrix V such that S 1 0 0 V 1 0 S 0 T V = S, where S := 0 0 S n 34
Therefore, by defining we can express the relation 39) as z k := V 1 w k, 31) z k = Sz k 1 = S z k = = S k z 0 313) Since S is a diagonal matrix whose diagonal elements all have magnitude β, we have that z k = β k z 0, 314) where z = z H z When we choose we find that α := 1 αm = 1 αl = 4 L + ), 315) L L + ), 316) so a choice of β that satisfies 310) would be β := L L + 317) Note that this value of β lies strictly between the lower bound 316) and the upper bound of 1, as required by 310)) When we use κ = L/m to denote the condition number of Q, we have β = 1 κ + 1 318) We conclude with the following result concerning R-linear convergence of x k to x Theorem 3 For the choices 315) of α and 317) β, the heavy-ball iteration 33) produces a sequence {x k } that converges R-linearly to x with factor β = 1 / κ + 1) Proof We have from 38), 31), and 314) that proving the result x k x w k V z k = β k V z 0, We have monotonic decrease in the norm of the transformed vector z k from 313)) but not the original error vector w k In fact, experiments can show sharp increases in w k on early iterations before the R-linear convergence promised by Theorem 3 takes effect Let us compare the linear convergence of heavy ball against steepest descent, on nonconvex quadratics Recall from 18) that the steepest-descent method with constant step α = 1/L requires OL/m) log ɛ) iterations to obtain a reduction of factor ɛ in the function error fx k ) f The rate defined by β in Theorem 3 suggests a complexity of O L/m log ɛ) to obtain a reduction of factor ɛ in w k a different quantity) For problems in which the condition number κ = L/m is moderate to large, the heavy-ball method has a significant advantage For example, if κ = 1000, the improved rate translates into a factor-of-30 reduction in number of iterations required, with virtually identical workload per iteration one gradient evaluation and a few vector operations) 35