On the steepest descent algorithm for quadratic functions

Size: px

Start display at page:

Download "On the steepest descent algorithm for quadratic functions"

Camilla Aleesha Russell
5 years ago
Views:

1 On the steepest descent algorithm for quadratic functions Clóvis C. Gonzaga Ruana M. Schneider July 9, 205 Abstract The steepest descent algorithm with exact line searches (Cauchy algorithm) is inefficient, generating oscillating step lengths and a sequence of points converging to the span of the eigenvectors associated with the extreme eigenvalues. The performance becomes very good if a short step is taen at every (say) 0 iterations. We show a new method for estimating short steps, and propose a method alternating Cauchy and short steps. Finally, we use the roots of a certain Chebyshev polynomial to further accelerate the method. Introduction We study the quadratic minimization problem (P z ) minimize z IR n f (z) = c T z + 2 zt Hz, where c IR n and H IR n n is symmetric with eigenvalues 0 < d < d 2 <... < d n, and condition number C = d n /d. The problem has a unique solution z IR n. The steepest descent algorithm, also called gradient method, is a memoryless method defined by z 0 IR n given, z + = z λ f (z ), λ > 0. () The only distinction among different steepest descent algorithms is in the choice of the step lengths λ. For the analysis, the problem may be simplified by assuming that z = 0, and so f (z) = z T Hz/2. The matrix H may be diagonalized by setting z = Mx, where M has orthonormal eigenvectors of H as columns. Then the function becomes f (x) = 2 xt Dx, D = diag(d, d 2,..., d n ), (2) so that for z IR m, f (z) = f (M T z). M defines a similarity transformation, and hence for z = Mx, z = x, f (z) = f (x), f (z) = M f (x), using throughout the paper the 2-norm z 2 = z T z. 2 Department of Mathematics, Federal University of Santa Catarina, Florianópolis, SC, Brazil; ccgonzaga@gmail.com. The author was partially supported by CNPq under grant 30843/ Department of Mathematics, Federal University of Santa Catarina, Florianópolis, SC, Brazil; ruanamaira@gmail.com.

2 2 (P) We define the diagonalized problem minimize x IR n f (x) = 2 xt Dx. The steepest descent iterations with step lengths λ for minimizing respectively f ( ) from the initial point x 0 = M T z 0, and f ( ) from the initial point z 0, are related by z = Mx. Thus, we may restrict our study to the diagonalized problem. The steepest descent method, was devised by Augustine Cauchy [?] in 847. He studied the quadratic minimization problem, using in each iteration the Cauchy step λ = argmin λ 0 f (x λ f (x )). (3) The steepest descent method with Cauchy steps will be called Cauchy algorithm. Steepest descent is the most basic algorithm for the unconstrained minimization of continuously differentiable functions, with step lengths computed by a multitude of line search schemes. The quadratic problem is the simplest non-trivial non-linear programming problem. Being able to solve it is a pre-requisite for any method for more general problems, and this is the first reason for the great effort dedicated to its solution. A second reason is that the optimal solution of (P z ) is the solution of the linear system Hz = c. It was soon noticed that the Cauchy algorithm generates inefficient zig-zagging sequences. This phenomenon was established by Aaie [?] in 959, and further developed by Forsythe [?]. A clear explanation of its consequences is found in Nocedal, Sartenaer and Zhu [?]. For some time the steepest descent method was displaced by methods using second order information. In the last years gradient methods returned to the scene due to the need to tacle large scale problems, with millions of variables, and due to novel methods for computing the step lengths. Barzilai and Borwein [?] proposed a new step length computation with surprisingly good properties, which was further extended to non-quadratic problems by Raydan [?], and studied by Dai [?], Raydan and Svaiter [?], Birgin, Martínez and Raydan [?], among others. In another line of research, several methods were developed to enhance the Cauchy algorithm by breaing its zig-zagging pattern. For the latest developments of this subject, see Asmundis et al. [?,?]. This paper has two goals. In section 2 we write complete proofs for the main properties of the Cauchy algorithm, including the classical results by Aaie, hopefully simplifying the treatment. In section 3 we develop new ways of breaing the oscillatory behavior of the algorithm and compare several enhancements. We do not attempt to generalize these procedures to non-quadratic or to constrained problems. 2 The Cauchy algorithm In this section we intend to state and prove the main asymptotic properties of the Cauchy algorithm. These properties will be summarized in Theorem?? below, and then the whole section is dedicated to prove it and describe some other properties of the sequences generated by the algorithm. The reader may well accept the theorem without going through the proofs, and proceed to the next section. The reason for proving these results, which are not original, is that the original Aaie and Forsythe papers, not being aimed exclusively at these results, are frequently not considered easy to read. We also prove simplified versions of results in [?,?,?], with a unified notation. Given a point x IR n, with g = f (x), the Cauchy step from x defined as where x + = (I λd)x, λ = argmin σ 0 g + = (I λd)g. f (x σg) = gt g g T Dg.

3 3 If follows from the definition of λ that g T g + = 0. We shall frequently use the value µ = /λ = (g T Dg)/ g 2. The following sequences will be associated with an application of the steepest descent algorithm from an initial point x 0 with g 0 = f (x 0 ): (x ): iterates: x + = (I λ D)x. (g ): gradients: g + = (I λ D)g. (λ ): Cauchy step length from x. (µ ): µ = /λ. (y ): normalized gradients y = g / g. Assumption: We assume that g 0 has no zero components. This is done because if a component is null, then it will remain null forever. It is well nown that x 0 and g 0. The main goal of this section is the study of the asymptotic properties of the sequence of normalized gradients y, following Aaie [?] and Forsythe [?]. 2. The main theorem The theorem below summarizes some of the main results of the cited references. It is stated, commented, and then proved in the following section. Theorem. Consider the sequences with elements x, g, y IR n, λ and µ = /λ generated by the Cauchy algorithm from an initial point x 0 IR n, assuming that n > and x 0, x0 n 0. Then there exist µ, µ (d, d n ), r, r IR n and α (0, ) such that (i) µ 2 µ, µ 2+ µ, with µ + µ = d + d n. (ii) y i 0 for i = 2, 3,..., n. (iii) y 2 r, y 2+ r, with r, r L(e, e n ), where L(e, e n ) denotes the subspace generated by e and e n. (iv) lim g +2 i g for all i =,..., n such that g i 0 for IN. i g +2 i g +2 (v) lim = lim g = α for i = and i = n, with g i α 2 d d n d 2 δ 2, where d = (d + d n )/2 and δ = min{ d i d i =,..., n}. (vi) The limiting values for µ, µ are bounded by µ min = d ( d 2 + δ 2 )/2 µ, µ d + ( d 2 + δ 2 )/2 = µ max. (vii) For large, g n oscillates around the values of g, with r = r n = µ d µ. d r n r One of the scopes of this paper is to present a complete proof of this theorem. The proof will be postponed to the next section, after a qualitative analysis of the asymptotic properties of the algorithm. For this discussion, assume that the condition number C is large, say, C >> 0, and that the space dimension is n > 2, possibly large. The variables will be loosely classified as light, medium and heavy, associated respectively with small, medium and large eigenvalues.

4 4 In a typical step, g + i = ( λd i )g i, λ = /µ. Let us comment on step sizes Short steps: safe and frequent. We call short a step with µ d n /2. Short steps are: Harmless: all g i decrease. Efficient for heavy variables. Inefficient for light variables: if λ << /d i, then g + i / g i = λd i is near. We conclude that short steps reduce heavy variables, with little effect on light variables. About one half of the steps will be short. Large steps: dangerous but needed. We call large a step with µ < d n /2. Large steps are: Dangerous: the heavy variables ( g n ) increase. Reasonably efficient for light variables. Very light variables (d i near d ) can only be reduced by very large steps. For example, if d = 0.00, d n =, a step λ = 500 >> 2/d n produces g + = g /2 and g + n = 499g. We conclude that only very large steps are efficient for reducing g, and they increase very much the heavy variables. Medium steps are also inefficient for the light variables. Cauchy steps: safe and essential. Cauchy steps satisfy µ (d, d n ), and hence with g + i = ( d i /µ)g i, g always decreases in absolute value, eeping the sign unchanged; g n changes sign at all iterations, and increases in absolute value if the step is large. The Cauchy algorithm generates a sequence of steps λ = /µ that oscillate, with (µ, µ + ) converging to the two limit points µ, µ, with µ + µ = d + d n. Then (say) µ < (d + d n )/2 and µ > (d + d n )/2 > d n /2. So, for large, the steps alternate between short and large, Cauchy steps are in general medium short, i.e., they are not very large. The sequence of normalized gradients (y ) zig-zags, so that the pairs of consecutive iterations (y 2, y 2+ ) converge to a pair of vectors (r, r ) in L(e, e n ). In the original space, this is the subspace generated by the eigenvectors associated with the two extreme eigenvalues d, d n. The step lengths also zig-zags, so that (λ 2, λ 2+ ) (λ, λ ), or equivalently (µ 2, µ 2+ ) (µ, µ ), with µ + µ = d + d n. For large, the absolute values of all gradient components must decrease in each pair of iterations. In fact, g never changes sign and decreases in absolute value in all iterations; g n changes size in all iterations, while g n oscillates around g for large. Let us follow the iterates of the Cauchy algorithm for an example with 50 variables, with eigenvalues 0.0 d i, xi 0 = / d i. Here are some numerical values for this example: µ min = 0.48, µ = 0.467, µ = 0.543, µ max = α = 0.922, r n /r = 0.926, r n/r =.080. The value α in item (v) deserves some attention: if C is large and δ is small (there exists an eigenvalue near d), then α > 8d = 8/C, very near the worst case convergence rate nown for the Cauchy algorithm, given by ((C )/(C + )) 2 4/C, counting pairs of iterations. The figures show the values of g i, with the eigenvalues in the horizontal axis. Each figure shows three iterates g, g +, g +2, with vertical lines at the points µ, µ + for = 0, = 5 and = 2. One sees that the gradient components associated with eigenvalues near µ are substantially reduced in the iteration. The last figure reproduces the third one using a logarithmic scale in the horizontal axis. In the beginning iterations the heavy variables are reduced (in absolute value), and soon the oscillatory pattern is achieved. The values of µ, µ + converge to the values µ, µ as in the theorem,

5 5 Iterations 0,,2 Iterations 5,6, g g eigenvalues Iterations 2,3, eigenvalues Iterations 2,3, g 0 4 g eigenvalues eigenvalues Figure : Absolute values of gradient components in consecutive iterations, showing the values of µ = /λ. the medium gradient components are reduced and g n oscillates around g, being slowly reduced at each pair of iterations. In the last figure we see that the algorithm affects the medium heavy variables, with little effect of the light variables. Large steps (small values of µ) are needed to reduce the very light components, and they never happen. Fig.?? plots the evolution of µ (left) and of g and g n, showing the oscillatory behavior. The step lengths stabilize very fast. As g 0 n >> g 0, g n decreases (in an oscillatory fashion) in the beginning iterations to match g. 2.2 Step sizes 0 0 First and last components of g step length.8 g i Figure 2: Behavior of λ and of g, g n. λ oscillates, g decreases and g n oscillates.

6 6 2.2 The Aaie-Forsythe results: proof of the main theorem The results in this subsection are a simplified version of those in Forsythe [?], reduced to the case studied in this paper. We follow his proofs almost step by step, hopefully simplifying the treatment. The study is simplified by the introduction of the following sequence: (w ): defined by w 0 = g 0, w + = (µ I D)w. It is easy to see that w = g i=0 µ. The sequence (w ) does not seem to have much interest in itself, but as we shall see it is very handy for the study of each iteration of the algorithm. Note that the normalized gradients satisfy y = g / g = w / w. Notation. We are mostly interested in the sequence of gradient vectors. When studying an iteration, we use the following maps, defined for a vector g IR n, g 0: µ(g) = gt Dg g T g [d, d n ], g + (g) = (I µ(g) D)g, y+ (g) = g+ (g) g + (g). (4) Whenever no confusion is possible, we omit the argument. If g + 0, we also define µ + = µ(g + ) and g ++. The fact that µ [d, d n ] is a standard result in linear algebra. If g has at least two non-zero components, then µ(g) (d, d n ). When dealing with the sequence (w ), we use the following notation: w = (µi D)w, w = (µ + I D)w, (5) when w 0. Note that µ(g ) = µ(w ) = µ(y ). By construction of the Cauchy steps, g T g + = w T w = y T y + = 0. Let us begin by isolating a special case (and later prove that it never happens): Lemma. Suppose that w = αe p for some p {, 2,..., n}, α 0. Then µ( w) = d p, w w (w) = 0, lim = 0. w w w Proof. Clearly, µ( w) = e T p De p = d p, and then w = α(d p I D)e p = 0, because De p = d p e p. The last result follows from the continuity of µ( ) (and hence of w ( )) at w 0. Lemma 2. Let p < q be the first and last non-zero components of a vector w IR n. Then µ(w) (d p, d q ) and w p 0, w q 0. For the Cauchy algorithm, if g 0 0 and g0 n 0, then the same is true for all w and g. Proof. The result follows from (??) and the lemma above: µ(w) = d p (or µ(w) = d q ) can only occur if w (or w n ) is its only non-zero component. The result on the sequence (g ) follows by induction. From the lemmas above we see that the behavior of the Cauchy algorithm is anomalous at points g on the coordinate axes. Let us call such vectors scalar, and eliminate them by defining the sets Ω = {g IR n g is not scalar}, Z = {y Ω y = }. Continuous maps: now all the maps defined above (µ, µ +, g +, g ++,...) are continuous on Ω. We use the simplified notation µ = µ(w),.... We assume that the Cauchy algorithm is applied to the problem (P) from a point g 0 Ω with no null component. Theorem 2. Let (w ) be generated by the Cauchy algorithm. Then for = 0,,..., w + w where ψ is the angle between w and w +2. = cos ψ w +2 w + w+2 w +,

7 7 Proof. Using our simplified notation for w = w, we first prove that w + 2 = w T w ++ : We have w T w ++ = w T (µi D)(µ + I D)w w + 2 = w T (µi D)(µI D)w. Subtracting, we obtain w T w ++ w + 2 = w T (µi D)(µ + µ)w = (µ + µ)w T w + = 0. Now, using this and Cauchy-Schwartz, w T w ++ = w w ++ cos(ψ) = w + 2. which divided by w w + gives the desired result, completing the proof. Theorem 3. Let (w ) be generated by the Cauchy algorithm from g 0 Ω, and let ψ be the angle between w and w +2. Then: (i) The the sequence (φ ), with φ = w + / w increases towards a limit L > 0. (ii) lim ψ = 0 and lim y y +2 = 0. (iii) All limit points of (y ) are in Z. Proof. (i) The sequence increases as a consequence of Theorem??. Let us show that it is bounded: given IN, w + = (µ I D)w, and then w + µ I D w. We now that µ (d, d n ), and hence µ I D max{ µ d i i =, 2,..., n} d n d, φ = w + w d n d. Thus the sequence is increasing and bounded, implying that it converges to some value L. (ii) From Theorem??, φ = cos(ψ )φ +, and taing limits, cos(ψ ) = φ φ +, showing that ψ 0. As y = y +2 = and ψ is the angle between these vectors, (ii) holds. (iii) Assume that y K r, with K IN. If r Z, then necessarily r = ±e p for some p =,..., n. Then, by Lemma??, w (y ) lim y r y = 0. This contradicts (i), because w + w which converges to L > 0, completing the proof. = w (y ) y, Summing up, the algorithm generates a sequence (y ) with y = g = w, whose limit points are all in the set Z. Then all transformations lie µ(y), w + (y),... are continuous at any limit point r of (y ). g w

8 8 For instance, assuming that y K r, and defining r = w (r) = (µ(r)i D)r, r = w (r) = (µ + (r)i D)r, r + = r / r, r ++ = w / w, we deduce that y + K r + and y +2 K r ++ = r. This last equality follows from Theorem??. Also from this theorem, we see that r = L 2 r. Remars. There should be no confusion: given r Z with unit norm, r = (µi D)r has r = L, and r + has unit norm. Also r ++ = r. The sequence (y 2 ) satisfies y +2 y 0. It is then easy to prove that either it is convergent or it has no isolated limit point (actually, the set of limit points is a continuum). Theorem 4. Let (y ) be the sequence of normalized gradients generated by the Cauchy algorithm from y 0 = g 0 / g 0. Then (y 2 ) converges to a non-scalar point r L(e, e n ), the linear space generated by the basis vectors e and e n. Proof. We already now that any limit point of (y ) is non-scalar. Let r be a limit point of (y 2 ). The proof will be done in three steps. (i) There exist two indices p < q such that r, r + L(e p, e q ). Let r be a limit point of (y ), and define the vectors r, r as above. We have r i = (µ d i )(µ + d i )r i = L 2 r i, i =,..., n, where µ, µ + are fixed. For each i, either r i = 0 or (µ d i )(µ + d i ) = L 2. This is a second degree equation, which will have two real solutions, say, d i = d p and d i = d q, because Theorem??(iii) prevents a single solution. If r L(e p, e q ), then trivially r = (µi D)r L(e p, e q ). (ii) r is the unique limit of (y 2 ). Assume by contradiction that r is not a unique limit point, and hence a non-isolated limit point of (y 2 ). Then there exist an infinite number of limit points r satisfying r r < min{ r p, r q }. All such points must belong to L(e p, e q ), for the following reason: any limit point r must belong to some space L(e s, e t ), where s, t are indices. If {s, t} {p, q}, then either r p = 0 or r q = 0, and then r r min r p, r q. Let us examine a point r L(e p, e q ). We have and immediately, We also now that r = L, where So, Hence, we have the equations µ = r 2 pd p + r 2 qd q, µ d p = r 2 q(d p d q ), µ d q = r 2 p(d q d p ). r p = (µ d p )r p, r q = (µ d q )r q. L = (r p) 2 + (r q) 2 = (d q d p ) 2 (r 4 qr 2 p + r 4 pr 2 q) = (d q d p ) 2 r 2 pr 2 q(r 2 q + r 2 p) = (d q d p ) 2 r 2 pr 2 q. r 2 pr 2 q = L/(d q d p ) 2 r 2 p + r 2 q =, a system with a finite number of isolated solutions, establishing a contradiction and proving (ii). (iii) r L(e, e n ).

9 9 Let us prove that q = n. The proof that p = is similar. Assume by contradiction that q < n. We now that both r and r + belong to L(e p, e q ), and by Lemma??, µ(r) < d q < d n and µ(r + ) < d q. Since r, r + are the unique limit points of the sequence (y ), for sufficiently large, say,, µ(y ) < d q. For such, g + q g + = µ(y ) d q < µ(y n ) d q =. g q g n Hence for all > g q < g q g n, g n contradicting the fact that y n 0 and y q r q Properties of the limiting points Now we now that the study of asymptotic properties of the sequences generated by the Cauchy algorithm may be reduced to the study of the limit points. This leads to simple results that reproduce properties described by Nocedal, Sartenaer and Zhu [?] and by De Asmundis et al [?,?]. Consider an application of the Cauchy algorithm as above, starting from x 0 with no null component, and define r = lim y 2, with y = g / g. We now that the only non-zero components of r are r and r n. From the analysis above, we now that Lemma 3. µ + µ = d + d n. Proof. As r = (µi D)(µ I D)r = L 2 r, Then r = (6) µ = d r 2 + d nrn, 2 λ = /µ (7) r = (µi D)r, r = L (8) µ = (r ) T Dr/ r 2, λ = /µ (9) r = (µ I D)r = L 2 r, r = L 2. (0) L 2 = (µ d )(µ d ) = (µ d n )(µ d n ). d (µ + µ ) + d 2 = d n (µ + µ ) + dn 2 (d n d )(µ + µ ) = dn 2 d 2. The result follows by dividing by (d n d ), completing the proof. This completes the proofs of the items (i)-(iii) of Theorem??. Let us prove the remaining items, studying a double iteration for large : g +2 i lim g i because d i d n. As g i 0, we must have lim g +2 i g, i = ( d i µ )( d i µ ) ( d n µ )( d n ), () µ

10 0 proving (iv). In view of the inequality above, this is reduced to Developing (??), we obtain immediately µµ (µ + µ )d i d 2 i 2 ( d i µ )( d i ) >, i =,..., n. (2) µ = (d + d n )d i d 2 i 2, µ + µ = d + d n. (3) Let us define d = (d + d n )/2, δ i = d d i, i =,..., n and δ = argmin{ δ i i =,..., n}. It is straightforward to chec that the numerator of (??) satisfies and hence we now that µ, µ must satisfy which is equivalent to µµ (d + d n )d i d 2 i = d 2 δ 2 i, d 2 δ 2 i 2, µ + µ = 2 d, i =,..., n, µµ d 2 δ 2, µ + µ = 2 d. (4) 2 The components g, g n are reduced by α = ( d n /µ)( d n /µ ). Developing this expression and substituting µ + µ = d + d n we obtain From (??) and this expression we conclude that, α = d d n µµ. (5) α 2 d d n d 2 δ 2, (6) proving item (v) of Theorem??. Bounds for the inverse steps µ, µ are obtained by solving the system (??), which reduces to a simple second degree equation whose solution is µ, µ = d ± ( d 2 + δ 2 )/2, (7) and thus µ µ, µ µ, proving the item (vi). The last item of the main theorem is proved in the following lemma: Lemma 4. There exists c > 0 such that c = r n = r = r r n µ d µ, with r n = r. d r r n Proof. By construction of the Cauchy step, r r, i.e., r r + r nr n = 0. The first equalities follow by dividing this by r r n. Using this result, consider r + = (I D/µ). We have r + n = ( d n /µ)r n, r + = ( d µ)r, and so r n + r + r n = ( d n/µ) = µ d n r n = r. ( d /µ) r µ d r r n

11 From this last equality, it follows that c 2 = r2 n r 2 = µ d µ d n = µ d µ d, using the fact that µ + µ = d + d n, completing the proof. Some interesting relations proved in [?] are now straightforward: from r n = c r and r 2+r2 n = and then r 2 + c2 r 2 = we obtain: r 2 = Let us now relate the constants L and c. We have + c 2, r2 n = c2 + c 2, (8) µ = d + c 2 d n + c 2 = d + c 2 C + c 2, (9) µ = c2 d + d n + c 2 = d c 2 + C + c 2. (20) L 2 = (µ d )(µ d ), + c 2 C µ d = d + c 2 d c 2 (C ) = d + c 2 µ d = d c 2 + C + c 2 d = d C + c 2. Simplifying these expressions and using Lemma?? we obtain 3 Breaing the cycle L 2 = c2 (d n d ) 2 ( + c 2 ) 2 = µµ d d n. (2) The Cauchy algorithm reduces the medium variables and acts very slowly on the light and heavy variables. The cycle may be broen by enforcing either a very short (µ near d n ) or a very large (µ near d ) step. It is difficult to estimate the value d, and a very large step will cause a great increase in the heavy variables (in absolute values), and will possibly increase the function value. A very large step should only be allowed if it is obtained by a Cauchy iterate, which always reduces the function value. A very short step is harmless, and reduces the heavy variables. If the medium variables are already small, the function will then be dominated by the light variables, and the next Cauchy step will be large, breaing the cycle. This is done by Asmundis et al. [?]: if µ, µ + are near µ, µ, then µ + µ + d + d n, and then the step /(µ + µ + ) will be short. They chec the evolution of µ and decide to periodically estimate and apply short steps by this method. Their algorithm was named steepest descent with alignment (SDA). We shall follow their approach, but with a different way of estimating short steps: all steps will be Cauchy steps taen at some point, maing sure that in all iterations satisfy µ (d, d n ). In the algorithms below we call λ gradient step a steepest descent step with step length λ. The algorithm below is the Cauchy algorithm modified by periodically applying short steps. Stopping rule. Each step must be followed by the statement = + and by testing a stopping rule. The usual test is g ɛ, for a given ɛ > 0. In our examples with diagonal Hessian matrices we test the function values, because the optimal value is zero.

12 2 Algorithm. Cauchy-short Algorithm Data: x 0 IR n, K I, K C, K s IN (respectively 0, 6, 2 in our choice), = 0. Tae K I Cauchy steps. repeat Compute an estimated short step λ s. Tae K s λ s gradient steps. Tae K C Cauchy steps. Remar. The choice of K I, K C, K s is somewhat arbitrary, but we noticed in many tests that the oscillatory pattern is achieved in about 0 Cauchy iterations initially, and in about 6 Cauchy iterations after each set of short steps. Two consecutive short steps are sufficient to cause a substantial reduction in the large variables. See Fig.?? for an example. Note that even if the estimated short step is not very near /d n, the short steps are harmless. Estimating a short step: a moc large step. We intend to state an algorithm in which all iterations tae Cauchy steps, and so all steps will satisfy d < µ < d n. As the cycle pattern is established, an artificially very large step will cause a great increase in the heavy variables, and consequently the next Cauchy step will be short. The very large step will be computed but not applied. Let S be a very large step length (S >> /d ). At an iteration we compute the following short step: g = (I S D)g (22) λ s = g T g g T D g. (23) The reason why this will be a short step is the following: for a large value of S, Hence the following Cauchy step will satisfy g i = ( S d i )g i S d i g i, i =,..., n. λ s = gt g g T D g = n i= g2 d2 i + g 2 nd 2 n n i= g2 d3 i + g 2 nd 3 n If the medium components are small, then for i =,..., n, g i d i << g n d n, and then λ s /d n. A safeguard may be used to enforce short steps, by setting λ s = min{λ s, λ, λ 2 }, as min{λ, λ 2 } should be a short step. Fig.?? shows iterates before and after the short steps for our example. Fig.?? shows the behavior of the step lengths for the Cauchy-short algorithm: the cyclic pattern is broen by periodically forcing a short step, which is naturally followed by a large Cauchy step. Alternating Cauchy and short steps. It is nown that about one half of the steps will be short. The algorithm below uses the computation of short steps as above, but applies them after all Cauchy steps: Algorithm 2. Alternated Cauchy-short steps. Data: x 0 IR n, K I, K C, K s (respectively 0, 6, 2 in our choice), = 0. Tae K I Cauchy steps. repeat Compute an estimated short step λ s. Tae K s λ s gradient steps. Tae K C iterations composed of a Cauchy step followed by one λ s gradient step.

13 3 Short step, followed by a large step Following iteration g 0 4 g eigenvalues eigenvalues Figure 3: Before and after 2 short steps in CS algorithm. 0 0 First and last components of g 0 3 Step sizes g i 0 6 step length Figure 4: Behavior of λ and of g, g n in an application of the CS algorithm. The algorithm has a stronger effect on the very heavy variables. Consequently the Cauchy and eventually the short steps increase. Fig.?? shows the behavior of the function values on an example with C = 000 and n = 000, eigenvalues uniformly distributed between 0.00 and, xi 0 = / d i. The figure shows the evolution of the Barzilai-Borwein (BB), the Cauchy-short (CS), the steepest descent with alignment (SDA), and the Alternated Cauchy-short (ACS) algorithms. It also shows the effect of using Chebyshev roots to correct the step lengths, as described below. Using Chebyshev roots. The paper [?] has the following result: if bounds l d and u d n are nown, then the stopping condition xi ɛ x0 i for i =,..., n is achieved by the following steepest descent scheme: For = 0 to K, g + = (I Λ D)g, where, using C = u/l, ) and the set of step lengths is ( 2 cosh ε K = K(C) = ( cosh + 2 ) C Λ = {/µ = 0,..., K }, C 2 log ( 2 ε ), µ = u l ( ) cos 2K π + u + l 2, (24)

14 4 0 4 Convergence history BB CS ACS CS using Chebyshev roots (adaptative) SDA f(x) Iteration Figure 5: Function values for the BB, CS, ACS, SDA and CS with steps corrected to Chebyshev roots using adaptive bounds for the eigenvalues. taen in any order. If the bounds l and u are exact, this number of iterations coincides almost exactly with the worst case performance of the Krylov space method, the best possible. This means that if good bounds for the eigenvalues are nown, then the number of iterations of any gradient algorithm may be limited to K(C) if the step lengths are corrected to match elements of Λ. This is the resulting scheme: Let C = u/l, compute K(C) and the set Λ by (??). Tae any steepest descent method, and execute the following command after the computation of each step length λ ; Set λ = argmin{ ν λ ν Λ}. Remove the element λ from the set Λ. Adaptive scheme. This scheme depends on reliable lower and upper bounds l and u for the eigenvalues. When they are not available, we construct them by adding the following procedure to Algorithms?? and??: Initialization: choose an integer 0 > K i + K s, tae the first 0 iterations of the algorithm and set u =.2 max =0,...,0 /λ, l = 0.25 min =0,...,0 /λ, C = u/l, and compute Λ by (??). Adaptation: at all steps with > 0, perform the following adaptation of the set Λ: if u < /λ, set u =.2 u, C=u/l and reset Λ by (??). if l > /λ, set l = l/4, C=u/l and reset Λ by (??). The initial value of u will hopefully satisfy u > d n, because the short step computation is efficient, both using our algorithms and ASD (with 0 properly chosen). The lower bound will probably be updated, and each time l changes the size of Λ doubles. We applied this scheme to our test problems, using the Cauchy-short algorithm. The result is shown in Fig.??. This scheme is only applicable to quadratic functions, but it reduces all variables to gi K ɛ g0 i, which may be important in the solution of linear systems of equations and least squares problems. Performance profile. Finally, Fig.?? is a performance profile, computed as described in [?], using 20 quadratic problems with 000 variables, three values of the condition number C = 000, 0000, 00000, and each problem with a different randomly generated initial point. The eigenvalues are randomly generated with four different distributions, exemplified by Fig.??. It shows that all new schemes are efficient, and seem to be superior to the Barzilai-Borwein method, with the advantage of generating monotonically decreasing function values.

15 5 0.8 Uniform dist. 0.8 Logarithmic dist Sinusoidal dist Blocs dist Figure 6: Eigenvalue distributions: a point (, s) means that d = sd n Barzilai-Borwein CS 0.2 ACS 0. CS using Chebyshev roots and adaptative C SDA Figure 7: Performance profile for the algorithms Barzilai-Borwein, Algorithm (??) (CS), Algorithm (??) (ACS), steepest descent with alignment (SDA) and CS with adaptive computation of Chebyshev steps. Conclusion. We believe that a good understanding of algorithms for quadratic problems is fundamental for the design of methods for more general problems. The steepest descent method, mainly using approximated Cauchy steps, is the most well nown of all methods. Its poor performance, which contradicts the simple intuition of greedily computing the maximum function reduction at each iteration, has always been frustrating. It was explained by Aaie, and only recently a simple cure has been found: just tae a short step once in a while. In this paper we hope to have summarized the classical proofs of this behavior, and proposed new procedures for quadratic problems. We did not, for the time being, try to apply similar schemes to non-quadratic functions, but we hope that these ideas may lead to general methods without the need of strategies for dealing with non-monotonicity.

16 6 References [] H. Aaie. On a successive transformation of probability distribution and its application to the analysis of the optimum gradient method. Ann. Inst. Statist. Math. Toyo, : 7, 959. [2] J. Barzilai and J. M. Borwein. Two-point step size gradient methods. IMA J. Numer. Anal., 8:4 48, 988. [3] E. G. Birgin, J. M. Martínez, and M. Raydan. Spectral Projected Gradient Methods. In C. A. Floudas and P. M. Pardalos, editors, Encyclopedia of Optimization, pages Springer, [4] A. Cauchy. Méthode générale pour la résolution des systèmes d équations simultanées. Comp. Rend. Acad. Sci. Paris, 25: , 847. [5] Y. H. Dai. Alternate step gradient method. Optimization, 52(4-5):395 45, [6] R. de Asmundis, D. di Serafino, W. Hager, G. Toraldo, and H. Zhang. An efficient gradient method using the Yuan steplength. Technical report, Sapienza University of Rome, Italy, 204. [7] R. de Asmundis, D. di Serafino, R. Riccio, and G. Toraldo. On spectral properties of steepest descent mehtods. IMA J. Numer. Anal., 33:46 435, 203. [8] E. D. Dolan and J. J. Moré. Benchmaring optimization software with performance profiles. Mathematical Programming, 9(2):20 23, [9] G. E. Forsythe. On the asymptotic directions of the s-dimensional optimum gradient method. Numerische Mathemati, :57 76, 968. [0] C. C. Gonzaga. Optimal performance of the steepest descent algorithm for quadratic functions. Technical report, Federal University of Santa Catarina, Florianopolis, Brazil, 204. [] J. Nocedal, A. Sartenaer, and C. Zhu. On the behavior of the gradient norm in the steepest descent method. Computational Optimization and Applications, 22:5 35, [2] M. Raydan. The Barzilai and Borwein gradient method for large scale unconstrained minimization problem. SIAM Journal on Optimization, 7:26 33, 997. [3] M. Raydan and B. Svaiter. Relaxed steepest descent and Cauchy-Barzilai-Borwein method. Computatinal Optimization and Applications, 2:55 67, 2002.

On spectral properties of steepest descent methods

ON SPECTRAL PROPERTIES OF STEEPEST DESCENT METHODS of 20 On spectral properties of steepest descent methods ROBERTA DE ASMUNDIS Department of Statistical Sciences, University of Rome La Sapienza, Piazzale