On the steepest descent algorithm for quadratic functions

Size: px
Start display at page:

Download "On the steepest descent algorithm for quadratic functions"

Transcription

1 On the steepest descent algorithm for quadratic functions Clóvis C. Gonzaga Ruana M. Schneider July 9, 205 Abstract The steepest descent algorithm with exact line searches (Cauchy algorithm) is inefficient, generating oscillating step lengths and a sequence of points converging to the span of the eigenvectors associated with the extreme eigenvalues. The performance becomes very good if a short step is taen at every (say) 0 iterations. We show a new method for estimating short steps, and propose a method alternating Cauchy and short steps. Finally, we use the roots of a certain Chebyshev polynomial to further accelerate the method. Introduction We study the quadratic minimization problem (P z ) minimize z IR n f (z) = c T z + 2 zt Hz, where c IR n and H IR n n is symmetric with eigenvalues 0 < d < d 2 <... < d n, and condition number C = d n /d. The problem has a unique solution z IR n. The steepest descent algorithm, also called gradient method, is a memoryless method defined by z 0 IR n given, z + = z λ f (z ), λ > 0. () The only distinction among different steepest descent algorithms is in the choice of the step lengths λ. For the analysis, the problem may be simplified by assuming that z = 0, and so f (z) = z T Hz/2. The matrix H may be diagonalized by setting z = Mx, where M has orthonormal eigenvectors of H as columns. Then the function becomes f (x) = 2 xt Dx, D = diag(d, d 2,..., d n ), (2) so that for z IR m, f (z) = f (M T z). M defines a similarity transformation, and hence for z = Mx, z = x, f (z) = f (x), f (z) = M f (x), using throughout the paper the 2-norm z 2 = z T z. 2 Department of Mathematics, Federal University of Santa Catarina, Florianópolis, SC, Brazil; ccgonzaga@gmail.com. The author was partially supported by CNPq under grant 30843/ Department of Mathematics, Federal University of Santa Catarina, Florianópolis, SC, Brazil; ruanamaira@gmail.com.

2 2 (P) We define the diagonalized problem minimize x IR n f (x) = 2 xt Dx. The steepest descent iterations with step lengths λ for minimizing respectively f ( ) from the initial point x 0 = M T z 0, and f ( ) from the initial point z 0, are related by z = Mx. Thus, we may restrict our study to the diagonalized problem. The steepest descent method, was devised by Augustine Cauchy [?] in 847. He studied the quadratic minimization problem, using in each iteration the Cauchy step λ = argmin λ 0 f (x λ f (x )). (3) The steepest descent method with Cauchy steps will be called Cauchy algorithm. Steepest descent is the most basic algorithm for the unconstrained minimization of continuously differentiable functions, with step lengths computed by a multitude of line search schemes. The quadratic problem is the simplest non-trivial non-linear programming problem. Being able to solve it is a pre-requisite for any method for more general problems, and this is the first reason for the great effort dedicated to its solution. A second reason is that the optimal solution of (P z ) is the solution of the linear system Hz = c. It was soon noticed that the Cauchy algorithm generates inefficient zig-zagging sequences. This phenomenon was established by Aaie [?] in 959, and further developed by Forsythe [?]. A clear explanation of its consequences is found in Nocedal, Sartenaer and Zhu [?]. For some time the steepest descent method was displaced by methods using second order information. In the last years gradient methods returned to the scene due to the need to tacle large scale problems, with millions of variables, and due to novel methods for computing the step lengths. Barzilai and Borwein [?] proposed a new step length computation with surprisingly good properties, which was further extended to non-quadratic problems by Raydan [?], and studied by Dai [?], Raydan and Svaiter [?], Birgin, Martínez and Raydan [?], among others. In another line of research, several methods were developed to enhance the Cauchy algorithm by breaing its zig-zagging pattern. For the latest developments of this subject, see Asmundis et al. [?,?]. This paper has two goals. In section 2 we write complete proofs for the main properties of the Cauchy algorithm, including the classical results by Aaie, hopefully simplifying the treatment. In section 3 we develop new ways of breaing the oscillatory behavior of the algorithm and compare several enhancements. We do not attempt to generalize these procedures to non-quadratic or to constrained problems. 2 The Cauchy algorithm In this section we intend to state and prove the main asymptotic properties of the Cauchy algorithm. These properties will be summarized in Theorem?? below, and then the whole section is dedicated to prove it and describe some other properties of the sequences generated by the algorithm. The reader may well accept the theorem without going through the proofs, and proceed to the next section. The reason for proving these results, which are not original, is that the original Aaie and Forsythe papers, not being aimed exclusively at these results, are frequently not considered easy to read. We also prove simplified versions of results in [?,?,?], with a unified notation. Given a point x IR n, with g = f (x), the Cauchy step from x defined as where x + = (I λd)x, λ = argmin σ 0 g + = (I λd)g. f (x σg) = gt g g T Dg.

3 3 If follows from the definition of λ that g T g + = 0. We shall frequently use the value µ = /λ = (g T Dg)/ g 2. The following sequences will be associated with an application of the steepest descent algorithm from an initial point x 0 with g 0 = f (x 0 ): (x ): iterates: x + = (I λ D)x. (g ): gradients: g + = (I λ D)g. (λ ): Cauchy step length from x. (µ ): µ = /λ. (y ): normalized gradients y = g / g. Assumption: We assume that g 0 has no zero components. This is done because if a component is null, then it will remain null forever. It is well nown that x 0 and g 0. The main goal of this section is the study of the asymptotic properties of the sequence of normalized gradients y, following Aaie [?] and Forsythe [?]. 2. The main theorem The theorem below summarizes some of the main results of the cited references. It is stated, commented, and then proved in the following section. Theorem. Consider the sequences with elements x, g, y IR n, λ and µ = /λ generated by the Cauchy algorithm from an initial point x 0 IR n, assuming that n > and x 0, x0 n 0. Then there exist µ, µ (d, d n ), r, r IR n and α (0, ) such that (i) µ 2 µ, µ 2+ µ, with µ + µ = d + d n. (ii) y i 0 for i = 2, 3,..., n. (iii) y 2 r, y 2+ r, with r, r L(e, e n ), where L(e, e n ) denotes the subspace generated by e and e n. (iv) lim g +2 i g for all i =,..., n such that g i 0 for IN. i g +2 i g +2 (v) lim = lim g = α for i = and i = n, with g i α 2 d d n d 2 δ 2, where d = (d + d n )/2 and δ = min{ d i d i =,..., n}. (vi) The limiting values for µ, µ are bounded by µ min = d ( d 2 + δ 2 )/2 µ, µ d + ( d 2 + δ 2 )/2 = µ max. (vii) For large, g n oscillates around the values of g, with r = r n = µ d µ. d r n r One of the scopes of this paper is to present a complete proof of this theorem. The proof will be postponed to the next section, after a qualitative analysis of the asymptotic properties of the algorithm. For this discussion, assume that the condition number C is large, say, C >> 0, and that the space dimension is n > 2, possibly large. The variables will be loosely classified as light, medium and heavy, associated respectively with small, medium and large eigenvalues.

4 4 In a typical step, g + i = ( λd i )g i, λ = /µ. Let us comment on step sizes Short steps: safe and frequent. We call short a step with µ d n /2. Short steps are: Harmless: all g i decrease. Efficient for heavy variables. Inefficient for light variables: if λ << /d i, then g + i / g i = λd i is near. We conclude that short steps reduce heavy variables, with little effect on light variables. About one half of the steps will be short. Large steps: dangerous but needed. We call large a step with µ < d n /2. Large steps are: Dangerous: the heavy variables ( g n ) increase. Reasonably efficient for light variables. Very light variables (d i near d ) can only be reduced by very large steps. For example, if d = 0.00, d n =, a step λ = 500 >> 2/d n produces g + = g /2 and g + n = 499g. We conclude that only very large steps are efficient for reducing g, and they increase very much the heavy variables. Medium steps are also inefficient for the light variables. Cauchy steps: safe and essential. Cauchy steps satisfy µ (d, d n ), and hence with g + i = ( d i /µ)g i, g always decreases in absolute value, eeping the sign unchanged; g n changes sign at all iterations, and increases in absolute value if the step is large. The Cauchy algorithm generates a sequence of steps λ = /µ that oscillate, with (µ, µ + ) converging to the two limit points µ, µ, with µ + µ = d + d n. Then (say) µ < (d + d n )/2 and µ > (d + d n )/2 > d n /2. So, for large, the steps alternate between short and large, Cauchy steps are in general medium short, i.e., they are not very large. The sequence of normalized gradients (y ) zig-zags, so that the pairs of consecutive iterations (y 2, y 2+ ) converge to a pair of vectors (r, r ) in L(e, e n ). In the original space, this is the subspace generated by the eigenvectors associated with the two extreme eigenvalues d, d n. The step lengths also zig-zags, so that (λ 2, λ 2+ ) (λ, λ ), or equivalently (µ 2, µ 2+ ) (µ, µ ), with µ + µ = d + d n. For large, the absolute values of all gradient components must decrease in each pair of iterations. In fact, g never changes sign and decreases in absolute value in all iterations; g n changes size in all iterations, while g n oscillates around g for large. Let us follow the iterates of the Cauchy algorithm for an example with 50 variables, with eigenvalues 0.0 d i, xi 0 = / d i. Here are some numerical values for this example: µ min = 0.48, µ = 0.467, µ = 0.543, µ max = α = 0.922, r n /r = 0.926, r n/r =.080. The value α in item (v) deserves some attention: if C is large and δ is small (there exists an eigenvalue near d), then α > 8d = 8/C, very near the worst case convergence rate nown for the Cauchy algorithm, given by ((C )/(C + )) 2 4/C, counting pairs of iterations. The figures show the values of g i, with the eigenvalues in the horizontal axis. Each figure shows three iterates g, g +, g +2, with vertical lines at the points µ, µ + for = 0, = 5 and = 2. One sees that the gradient components associated with eigenvalues near µ are substantially reduced in the iteration. The last figure reproduces the third one using a logarithmic scale in the horizontal axis. In the beginning iterations the heavy variables are reduced (in absolute value), and soon the oscillatory pattern is achieved. The values of µ, µ + converge to the values µ, µ as in the theorem,

5 5 Iterations 0,,2 Iterations 5,6, g g eigenvalues Iterations 2,3, eigenvalues Iterations 2,3, g 0 4 g eigenvalues eigenvalues Figure : Absolute values of gradient components in consecutive iterations, showing the values of µ = /λ. the medium gradient components are reduced and g n oscillates around g, being slowly reduced at each pair of iterations. In the last figure we see that the algorithm affects the medium heavy variables, with little effect of the light variables. Large steps (small values of µ) are needed to reduce the very light components, and they never happen. Fig.?? plots the evolution of µ (left) and of g and g n, showing the oscillatory behavior. The step lengths stabilize very fast. As g 0 n >> g 0, g n decreases (in an oscillatory fashion) in the beginning iterations to match g. 2.2 Step sizes 0 0 First and last components of g step length.8 g i Figure 2: Behavior of λ and of g, g n. λ oscillates, g decreases and g n oscillates.

6 6 2.2 The Aaie-Forsythe results: proof of the main theorem The results in this subsection are a simplified version of those in Forsythe [?], reduced to the case studied in this paper. We follow his proofs almost step by step, hopefully simplifying the treatment. The study is simplified by the introduction of the following sequence: (w ): defined by w 0 = g 0, w + = (µ I D)w. It is easy to see that w = g i=0 µ. The sequence (w ) does not seem to have much interest in itself, but as we shall see it is very handy for the study of each iteration of the algorithm. Note that the normalized gradients satisfy y = g / g = w / w. Notation. We are mostly interested in the sequence of gradient vectors. When studying an iteration, we use the following maps, defined for a vector g IR n, g 0: µ(g) = gt Dg g T g [d, d n ], g + (g) = (I µ(g) D)g, y+ (g) = g+ (g) g + (g). (4) Whenever no confusion is possible, we omit the argument. If g + 0, we also define µ + = µ(g + ) and g ++. The fact that µ [d, d n ] is a standard result in linear algebra. If g has at least two non-zero components, then µ(g) (d, d n ). When dealing with the sequence (w ), we use the following notation: w = (µi D)w, w = (µ + I D)w, (5) when w 0. Note that µ(g ) = µ(w ) = µ(y ). By construction of the Cauchy steps, g T g + = w T w = y T y + = 0. Let us begin by isolating a special case (and later prove that it never happens): Lemma. Suppose that w = αe p for some p {, 2,..., n}, α 0. Then µ( w) = d p, w w (w) = 0, lim = 0. w w w Proof. Clearly, µ( w) = e T p De p = d p, and then w = α(d p I D)e p = 0, because De p = d p e p. The last result follows from the continuity of µ( ) (and hence of w ( )) at w 0. Lemma 2. Let p < q be the first and last non-zero components of a vector w IR n. Then µ(w) (d p, d q ) and w p 0, w q 0. For the Cauchy algorithm, if g 0 0 and g0 n 0, then the same is true for all w and g. Proof. The result follows from (??) and the lemma above: µ(w) = d p (or µ(w) = d q ) can only occur if w (or w n ) is its only non-zero component. The result on the sequence (g ) follows by induction. From the lemmas above we see that the behavior of the Cauchy algorithm is anomalous at points g on the coordinate axes. Let us call such vectors scalar, and eliminate them by defining the sets Ω = {g IR n g is not scalar}, Z = {y Ω y = }. Continuous maps: now all the maps defined above (µ, µ +, g +, g ++,...) are continuous on Ω. We use the simplified notation µ = µ(w),.... We assume that the Cauchy algorithm is applied to the problem (P) from a point g 0 Ω with no null component. Theorem 2. Let (w ) be generated by the Cauchy algorithm. Then for = 0,,..., w + w where ψ is the angle between w and w +2. = cos ψ w +2 w + w+2 w +,

7 7 Proof. Using our simplified notation for w = w, we first prove that w + 2 = w T w ++ : We have w T w ++ = w T (µi D)(µ + I D)w w + 2 = w T (µi D)(µI D)w. Subtracting, we obtain w T w ++ w + 2 = w T (µi D)(µ + µ)w = (µ + µ)w T w + = 0. Now, using this and Cauchy-Schwartz, w T w ++ = w w ++ cos(ψ) = w + 2. which divided by w w + gives the desired result, completing the proof. Theorem 3. Let (w ) be generated by the Cauchy algorithm from g 0 Ω, and let ψ be the angle between w and w +2. Then: (i) The the sequence (φ ), with φ = w + / w increases towards a limit L > 0. (ii) lim ψ = 0 and lim y y +2 = 0. (iii) All limit points of (y ) are in Z. Proof. (i) The sequence increases as a consequence of Theorem??. Let us show that it is bounded: given IN, w + = (µ I D)w, and then w + µ I D w. We now that µ (d, d n ), and hence µ I D max{ µ d i i =, 2,..., n} d n d, φ = w + w d n d. Thus the sequence is increasing and bounded, implying that it converges to some value L. (ii) From Theorem??, φ = cos(ψ )φ +, and taing limits, cos(ψ ) = φ φ +, showing that ψ 0. As y = y +2 = and ψ is the angle between these vectors, (ii) holds. (iii) Assume that y K r, with K IN. If r Z, then necessarily r = ±e p for some p =,..., n. Then, by Lemma??, w (y ) lim y r y = 0. This contradicts (i), because w + w which converges to L > 0, completing the proof. = w (y ) y, Summing up, the algorithm generates a sequence (y ) with y = g = w, whose limit points are all in the set Z. Then all transformations lie µ(y), w + (y),... are continuous at any limit point r of (y ). g w

8 8 For instance, assuming that y K r, and defining r = w (r) = (µ(r)i D)r, r = w (r) = (µ + (r)i D)r, r + = r / r, r ++ = w / w, we deduce that y + K r + and y +2 K r ++ = r. This last equality follows from Theorem??. Also from this theorem, we see that r = L 2 r. Remars. There should be no confusion: given r Z with unit norm, r = (µi D)r has r = L, and r + has unit norm. Also r ++ = r. The sequence (y 2 ) satisfies y +2 y 0. It is then easy to prove that either it is convergent or it has no isolated limit point (actually, the set of limit points is a continuum). Theorem 4. Let (y ) be the sequence of normalized gradients generated by the Cauchy algorithm from y 0 = g 0 / g 0. Then (y 2 ) converges to a non-scalar point r L(e, e n ), the linear space generated by the basis vectors e and e n. Proof. We already now that any limit point of (y ) is non-scalar. Let r be a limit point of (y 2 ). The proof will be done in three steps. (i) There exist two indices p < q such that r, r + L(e p, e q ). Let r be a limit point of (y ), and define the vectors r, r as above. We have r i = (µ d i )(µ + d i )r i = L 2 r i, i =,..., n, where µ, µ + are fixed. For each i, either r i = 0 or (µ d i )(µ + d i ) = L 2. This is a second degree equation, which will have two real solutions, say, d i = d p and d i = d q, because Theorem??(iii) prevents a single solution. If r L(e p, e q ), then trivially r = (µi D)r L(e p, e q ). (ii) r is the unique limit of (y 2 ). Assume by contradiction that r is not a unique limit point, and hence a non-isolated limit point of (y 2 ). Then there exist an infinite number of limit points r satisfying r r < min{ r p, r q }. All such points must belong to L(e p, e q ), for the following reason: any limit point r must belong to some space L(e s, e t ), where s, t are indices. If {s, t} {p, q}, then either r p = 0 or r q = 0, and then r r min r p, r q. Let us examine a point r L(e p, e q ). We have and immediately, We also now that r = L, where So, Hence, we have the equations µ = r 2 pd p + r 2 qd q, µ d p = r 2 q(d p d q ), µ d q = r 2 p(d q d p ). r p = (µ d p )r p, r q = (µ d q )r q. L = (r p) 2 + (r q) 2 = (d q d p ) 2 (r 4 qr 2 p + r 4 pr 2 q) = (d q d p ) 2 r 2 pr 2 q(r 2 q + r 2 p) = (d q d p ) 2 r 2 pr 2 q. r 2 pr 2 q = L/(d q d p ) 2 r 2 p + r 2 q =, a system with a finite number of isolated solutions, establishing a contradiction and proving (ii). (iii) r L(e, e n ).

9 9 Let us prove that q = n. The proof that p = is similar. Assume by contradiction that q < n. We now that both r and r + belong to L(e p, e q ), and by Lemma??, µ(r) < d q < d n and µ(r + ) < d q. Since r, r + are the unique limit points of the sequence (y ), for sufficiently large, say,, µ(y ) < d q. For such, g + q g + = µ(y ) d q < µ(y n ) d q =. g q g n Hence for all > g q < g q g n, g n contradicting the fact that y n 0 and y q r q Properties of the limiting points Now we now that the study of asymptotic properties of the sequences generated by the Cauchy algorithm may be reduced to the study of the limit points. This leads to simple results that reproduce properties described by Nocedal, Sartenaer and Zhu [?] and by De Asmundis et al [?,?]. Consider an application of the Cauchy algorithm as above, starting from x 0 with no null component, and define r = lim y 2, with y = g / g. We now that the only non-zero components of r are r and r n. From the analysis above, we now that Lemma 3. µ + µ = d + d n. Proof. As r = (µi D)(µ I D)r = L 2 r, Then r = (6) µ = d r 2 + d nrn, 2 λ = /µ (7) r = (µi D)r, r = L (8) µ = (r ) T Dr/ r 2, λ = /µ (9) r = (µ I D)r = L 2 r, r = L 2. (0) L 2 = (µ d )(µ d ) = (µ d n )(µ d n ). d (µ + µ ) + d 2 = d n (µ + µ ) + dn 2 (d n d )(µ + µ ) = dn 2 d 2. The result follows by dividing by (d n d ), completing the proof. This completes the proofs of the items (i)-(iii) of Theorem??. Let us prove the remaining items, studying a double iteration for large : g +2 i lim g i because d i d n. As g i 0, we must have lim g +2 i g, i = ( d i µ )( d i µ ) ( d n µ )( d n ), () µ

10 0 proving (iv). In view of the inequality above, this is reduced to Developing (??), we obtain immediately µµ (µ + µ )d i d 2 i 2 ( d i µ )( d i ) >, i =,..., n. (2) µ = (d + d n )d i d 2 i 2, µ + µ = d + d n. (3) Let us define d = (d + d n )/2, δ i = d d i, i =,..., n and δ = argmin{ δ i i =,..., n}. It is straightforward to chec that the numerator of (??) satisfies and hence we now that µ, µ must satisfy which is equivalent to µµ (d + d n )d i d 2 i = d 2 δ 2 i, d 2 δ 2 i 2, µ + µ = 2 d, i =,..., n, µµ d 2 δ 2, µ + µ = 2 d. (4) 2 The components g, g n are reduced by α = ( d n /µ)( d n /µ ). Developing this expression and substituting µ + µ = d + d n we obtain From (??) and this expression we conclude that, α = d d n µµ. (5) α 2 d d n d 2 δ 2, (6) proving item (v) of Theorem??. Bounds for the inverse steps µ, µ are obtained by solving the system (??), which reduces to a simple second degree equation whose solution is µ, µ = d ± ( d 2 + δ 2 )/2, (7) and thus µ µ, µ µ, proving the item (vi). The last item of the main theorem is proved in the following lemma: Lemma 4. There exists c > 0 such that c = r n = r = r r n µ d µ, with r n = r. d r r n Proof. By construction of the Cauchy step, r r, i.e., r r + r nr n = 0. The first equalities follow by dividing this by r r n. Using this result, consider r + = (I D/µ). We have r + n = ( d n /µ)r n, r + = ( d µ)r, and so r n + r + r n = ( d n/µ) = µ d n r n = r. ( d /µ) r µ d r r n

11 From this last equality, it follows that c 2 = r2 n r 2 = µ d µ d n = µ d µ d, using the fact that µ + µ = d + d n, completing the proof. Some interesting relations proved in [?] are now straightforward: from r n = c r and r 2+r2 n = and then r 2 + c2 r 2 = we obtain: r 2 = Let us now relate the constants L and c. We have + c 2, r2 n = c2 + c 2, (8) µ = d + c 2 d n + c 2 = d + c 2 C + c 2, (9) µ = c2 d + d n + c 2 = d c 2 + C + c 2. (20) L 2 = (µ d )(µ d ), + c 2 C µ d = d + c 2 d c 2 (C ) = d + c 2 µ d = d c 2 + C + c 2 d = d C + c 2. Simplifying these expressions and using Lemma?? we obtain 3 Breaing the cycle L 2 = c2 (d n d ) 2 ( + c 2 ) 2 = µµ d d n. (2) The Cauchy algorithm reduces the medium variables and acts very slowly on the light and heavy variables. The cycle may be broen by enforcing either a very short (µ near d n ) or a very large (µ near d ) step. It is difficult to estimate the value d, and a very large step will cause a great increase in the heavy variables (in absolute values), and will possibly increase the function value. A very large step should only be allowed if it is obtained by a Cauchy iterate, which always reduces the function value. A very short step is harmless, and reduces the heavy variables. If the medium variables are already small, the function will then be dominated by the light variables, and the next Cauchy step will be large, breaing the cycle. This is done by Asmundis et al. [?]: if µ, µ + are near µ, µ, then µ + µ + d + d n, and then the step /(µ + µ + ) will be short. They chec the evolution of µ and decide to periodically estimate and apply short steps by this method. Their algorithm was named steepest descent with alignment (SDA). We shall follow their approach, but with a different way of estimating short steps: all steps will be Cauchy steps taen at some point, maing sure that in all iterations satisfy µ (d, d n ). In the algorithms below we call λ gradient step a steepest descent step with step length λ. The algorithm below is the Cauchy algorithm modified by periodically applying short steps. Stopping rule. Each step must be followed by the statement = + and by testing a stopping rule. The usual test is g ɛ, for a given ɛ > 0. In our examples with diagonal Hessian matrices we test the function values, because the optimal value is zero.

12 2 Algorithm. Cauchy-short Algorithm Data: x 0 IR n, K I, K C, K s IN (respectively 0, 6, 2 in our choice), = 0. Tae K I Cauchy steps. repeat Compute an estimated short step λ s. Tae K s λ s gradient steps. Tae K C Cauchy steps. Remar. The choice of K I, K C, K s is somewhat arbitrary, but we noticed in many tests that the oscillatory pattern is achieved in about 0 Cauchy iterations initially, and in about 6 Cauchy iterations after each set of short steps. Two consecutive short steps are sufficient to cause a substantial reduction in the large variables. See Fig.?? for an example. Note that even if the estimated short step is not very near /d n, the short steps are harmless. Estimating a short step: a moc large step. We intend to state an algorithm in which all iterations tae Cauchy steps, and so all steps will satisfy d < µ < d n. As the cycle pattern is established, an artificially very large step will cause a great increase in the heavy variables, and consequently the next Cauchy step will be short. The very large step will be computed but not applied. Let S be a very large step length (S >> /d ). At an iteration we compute the following short step: g = (I S D)g (22) λ s = g T g g T D g. (23) The reason why this will be a short step is the following: for a large value of S, Hence the following Cauchy step will satisfy g i = ( S d i )g i S d i g i, i =,..., n. λ s = gt g g T D g = n i= g2 d2 i + g 2 nd 2 n n i= g2 d3 i + g 2 nd 3 n If the medium components are small, then for i =,..., n, g i d i << g n d n, and then λ s /d n. A safeguard may be used to enforce short steps, by setting λ s = min{λ s, λ, λ 2 }, as min{λ, λ 2 } should be a short step. Fig.?? shows iterates before and after the short steps for our example. Fig.?? shows the behavior of the step lengths for the Cauchy-short algorithm: the cyclic pattern is broen by periodically forcing a short step, which is naturally followed by a large Cauchy step. Alternating Cauchy and short steps. It is nown that about one half of the steps will be short. The algorithm below uses the computation of short steps as above, but applies them after all Cauchy steps: Algorithm 2. Alternated Cauchy-short steps. Data: x 0 IR n, K I, K C, K s (respectively 0, 6, 2 in our choice), = 0. Tae K I Cauchy steps. repeat Compute an estimated short step λ s. Tae K s λ s gradient steps. Tae K C iterations composed of a Cauchy step followed by one λ s gradient step.

13 3 Short step, followed by a large step Following iteration g 0 4 g eigenvalues eigenvalues Figure 3: Before and after 2 short steps in CS algorithm. 0 0 First and last components of g 0 3 Step sizes g i 0 6 step length Figure 4: Behavior of λ and of g, g n in an application of the CS algorithm. The algorithm has a stronger effect on the very heavy variables. Consequently the Cauchy and eventually the short steps increase. Fig.?? shows the behavior of the function values on an example with C = 000 and n = 000, eigenvalues uniformly distributed between 0.00 and, xi 0 = / d i. The figure shows the evolution of the Barzilai-Borwein (BB), the Cauchy-short (CS), the steepest descent with alignment (SDA), and the Alternated Cauchy-short (ACS) algorithms. It also shows the effect of using Chebyshev roots to correct the step lengths, as described below. Using Chebyshev roots. The paper [?] has the following result: if bounds l d and u d n are nown, then the stopping condition xi ɛ x0 i for i =,..., n is achieved by the following steepest descent scheme: For = 0 to K, g + = (I Λ D)g, where, using C = u/l, ) and the set of step lengths is ( 2 cosh ε K = K(C) = ( cosh + 2 ) C Λ = {/µ = 0,..., K }, C 2 log ( 2 ε ), µ = u l ( ) cos 2K π + u + l 2, (24)

14 4 0 4 Convergence history BB CS ACS CS using Chebyshev roots (adaptative) SDA f(x) Iteration Figure 5: Function values for the BB, CS, ACS, SDA and CS with steps corrected to Chebyshev roots using adaptive bounds for the eigenvalues. taen in any order. If the bounds l and u are exact, this number of iterations coincides almost exactly with the worst case performance of the Krylov space method, the best possible. This means that if good bounds for the eigenvalues are nown, then the number of iterations of any gradient algorithm may be limited to K(C) if the step lengths are corrected to match elements of Λ. This is the resulting scheme: Let C = u/l, compute K(C) and the set Λ by (??). Tae any steepest descent method, and execute the following command after the computation of each step length λ ; Set λ = argmin{ ν λ ν Λ}. Remove the element λ from the set Λ. Adaptive scheme. This scheme depends on reliable lower and upper bounds l and u for the eigenvalues. When they are not available, we construct them by adding the following procedure to Algorithms?? and??: Initialization: choose an integer 0 > K i + K s, tae the first 0 iterations of the algorithm and set u =.2 max =0,...,0 /λ, l = 0.25 min =0,...,0 /λ, C = u/l, and compute Λ by (??). Adaptation: at all steps with > 0, perform the following adaptation of the set Λ: if u < /λ, set u =.2 u, C=u/l and reset Λ by (??). if l > /λ, set l = l/4, C=u/l and reset Λ by (??). The initial value of u will hopefully satisfy u > d n, because the short step computation is efficient, both using our algorithms and ASD (with 0 properly chosen). The lower bound will probably be updated, and each time l changes the size of Λ doubles. We applied this scheme to our test problems, using the Cauchy-short algorithm. The result is shown in Fig.??. This scheme is only applicable to quadratic functions, but it reduces all variables to gi K ɛ g0 i, which may be important in the solution of linear systems of equations and least squares problems. Performance profile. Finally, Fig.?? is a performance profile, computed as described in [?], using 20 quadratic problems with 000 variables, three values of the condition number C = 000, 0000, 00000, and each problem with a different randomly generated initial point. The eigenvalues are randomly generated with four different distributions, exemplified by Fig.??. It shows that all new schemes are efficient, and seem to be superior to the Barzilai-Borwein method, with the advantage of generating monotonically decreasing function values.

15 5 0.8 Uniform dist. 0.8 Logarithmic dist Sinusoidal dist Blocs dist Figure 6: Eigenvalue distributions: a point (, s) means that d = sd n Barzilai-Borwein CS 0.2 ACS 0. CS using Chebyshev roots and adaptative C SDA Figure 7: Performance profile for the algorithms Barzilai-Borwein, Algorithm (??) (CS), Algorithm (??) (ACS), steepest descent with alignment (SDA) and CS with adaptive computation of Chebyshev steps. Conclusion. We believe that a good understanding of algorithms for quadratic problems is fundamental for the design of methods for more general problems. The steepest descent method, mainly using approximated Cauchy steps, is the most well nown of all methods. Its poor performance, which contradicts the simple intuition of greedily computing the maximum function reduction at each iteration, has always been frustrating. It was explained by Aaie, and only recently a simple cure has been found: just tae a short step once in a while. In this paper we hope to have summarized the classical proofs of this behavior, and proposed new procedures for quadratic problems. We did not, for the time being, try to apply similar schemes to non-quadratic functions, but we hope that these ideas may lead to general methods without the need of strategies for dealing with non-monotonicity.

16 6 References [] H. Aaie. On a successive transformation of probability distribution and its application to the analysis of the optimum gradient method. Ann. Inst. Statist. Math. Toyo, : 7, 959. [2] J. Barzilai and J. M. Borwein. Two-point step size gradient methods. IMA J. Numer. Anal., 8:4 48, 988. [3] E. G. Birgin, J. M. Martínez, and M. Raydan. Spectral Projected Gradient Methods. In C. A. Floudas and P. M. Pardalos, editors, Encyclopedia of Optimization, pages Springer, [4] A. Cauchy. Méthode générale pour la résolution des systèmes d équations simultanées. Comp. Rend. Acad. Sci. Paris, 25: , 847. [5] Y. H. Dai. Alternate step gradient method. Optimization, 52(4-5):395 45, [6] R. de Asmundis, D. di Serafino, W. Hager, G. Toraldo, and H. Zhang. An efficient gradient method using the Yuan steplength. Technical report, Sapienza University of Rome, Italy, 204. [7] R. de Asmundis, D. di Serafino, R. Riccio, and G. Toraldo. On spectral properties of steepest descent mehtods. IMA J. Numer. Anal., 33:46 435, 203. [8] E. D. Dolan and J. J. Moré. Benchmaring optimization software with performance profiles. Mathematical Programming, 9(2):20 23, [9] G. E. Forsythe. On the asymptotic directions of the s-dimensional optimum gradient method. Numerische Mathemati, :57 76, 968. [0] C. C. Gonzaga. Optimal performance of the steepest descent algorithm for quadratic functions. Technical report, Federal University of Santa Catarina, Florianopolis, Brazil, 204. [] J. Nocedal, A. Sartenaer, and C. Zhu. On the behavior of the gradient norm in the steepest descent method. Computational Optimization and Applications, 22:5 35, [2] M. Raydan. The Barzilai and Borwein gradient method for large scale unconstrained minimization problem. SIAM Journal on Optimization, 7:26 33, 997. [3] M. Raydan and B. Svaiter. Relaxed steepest descent and Cauchy-Barzilai-Borwein method. Computatinal Optimization and Applications, 2:55 67, 2002.

On spectral properties of steepest descent methods

On spectral properties of steepest descent methods ON SPECTRAL PROPERTIES OF STEEPEST DESCENT METHODS of 20 On spectral properties of steepest descent methods ROBERTA DE ASMUNDIS Department of Statistical Sciences, University of Rome La Sapienza, Piazzale

More information

Gradient methods exploiting spectral properties

Gradient methods exploiting spectral properties Noname manuscript No. (will be inserted by the editor) Gradient methods exploiting spectral properties Yaui Huang Yu-Hong Dai Xin-Wei Liu Received: date / Accepted: date Abstract A new stepsize is derived

More information

Adaptive two-point stepsize gradient algorithm

Adaptive two-point stepsize gradient algorithm Numerical Algorithms 27: 377 385, 2001. 2001 Kluwer Academic Publishers. Printed in the Netherlands. Adaptive two-point stepsize gradient algorithm Yu-Hong Dai and Hongchao Zhang State Key Laboratory of

More information

On the iterate convergence of descent methods for convex optimization

On the iterate convergence of descent methods for convex optimization On the iterate convergence of descent methods for convex optimization Clovis C. Gonzaga March 1, 2014 Abstract We study the iterate convergence of strong descent algorithms applied to convex functions.

More information

R-Linear Convergence of Limited Memory Steepest Descent

R-Linear Convergence of Limited Memory Steepest Descent R-Linear Convergence of Limited Memory Steepest Descent Fran E. Curtis and Wei Guo Department of Industrial and Systems Engineering, Lehigh University, USA COR@L Technical Report 16T-010 R-Linear Convergence

More information

Steepest Descent. Juan C. Meza 1. Lawrence Berkeley National Laboratory Berkeley, California 94720

Steepest Descent. Juan C. Meza 1. Lawrence Berkeley National Laboratory Berkeley, California 94720 Steepest Descent Juan C. Meza Lawrence Berkeley National Laboratory Berkeley, California 94720 Abstract The steepest descent method has a rich history and is one of the simplest and best known methods

More information

Math 164: Optimization Barzilai-Borwein Method

Math 164: Optimization Barzilai-Borwein Method Math 164: Optimization Barzilai-Borwein Method Instructor: Wotao Yin Department of Mathematics, UCLA Spring 2015 online discussions on piazza.com Main features of the Barzilai-Borwein (BB) method The BB

More information

Step-size Estimation for Unconstrained Optimization Methods

Step-size Estimation for Unconstrained Optimization Methods Volume 24, N. 3, pp. 399 416, 2005 Copyright 2005 SBMAC ISSN 0101-8205 www.scielo.br/cam Step-size Estimation for Unconstrained Optimization Methods ZHEN-JUN SHI 1,2 and JIE SHEN 3 1 College of Operations

More information

Primal-dual relationship between Levenberg-Marquardt and central trajectories for linearly constrained convex optimization

Primal-dual relationship between Levenberg-Marquardt and central trajectories for linearly constrained convex optimization Primal-dual relationship between Levenberg-Marquardt and central trajectories for linearly constrained convex optimization Roger Behling a, Clovis Gonzaga b and Gabriel Haeser c March 21, 2013 a Department

More information

On the regularization properties of some spectral gradient methods

On the regularization properties of some spectral gradient methods On the regularization properties of some spectral gradient methods Daniela di Serafino Department of Mathematics and Physics, Second University of Naples daniela.diserafino@unina2.it contributions from

More information

GLOBAL CONVERGENCE OF CONJUGATE GRADIENT METHODS WITHOUT LINE SEARCH

GLOBAL CONVERGENCE OF CONJUGATE GRADIENT METHODS WITHOUT LINE SEARCH GLOBAL CONVERGENCE OF CONJUGATE GRADIENT METHODS WITHOUT LINE SEARCH Jie Sun 1 Department of Decision Sciences National University of Singapore, Republic of Singapore Jiapu Zhang 2 Department of Mathematics

More information

A derivative-free nonmonotone line search and its application to the spectral residual method

A derivative-free nonmonotone line search and its application to the spectral residual method IMA Journal of Numerical Analysis (2009) 29, 814 825 doi:10.1093/imanum/drn019 Advance Access publication on November 14, 2008 A derivative-free nonmonotone line search and its application to the spectral

More information

R-Linear Convergence of Limited Memory Steepest Descent

R-Linear Convergence of Limited Memory Steepest Descent R-Linear Convergence of Limited Memory Steepest Descent Frank E. Curtis, Lehigh University joint work with Wei Guo, Lehigh University OP17 Vancouver, British Columbia, Canada 24 May 2017 R-Linear Convergence

More information

Chapter 7 Iterative Techniques in Matrix Algebra

Chapter 7 Iterative Techniques in Matrix Algebra Chapter 7 Iterative Techniques in Matrix Algebra Per-Olof Persson persson@berkeley.edu Department of Mathematics University of California, Berkeley Math 128B Numerical Analysis Vector Norms Definition

More information

Convergence of a Two-parameter Family of Conjugate Gradient Methods with a Fixed Formula of Stepsize

Convergence of a Two-parameter Family of Conjugate Gradient Methods with a Fixed Formula of Stepsize Bol. Soc. Paran. Mat. (3s.) v. 00 0 (0000):????. c SPM ISSN-2175-1188 on line ISSN-00378712 in press SPM: www.spm.uem.br/bspm doi:10.5269/bspm.v38i6.35641 Convergence of a Two-parameter Family of Conjugate

More information

This manuscript is for review purposes only.

This manuscript is for review purposes only. 1 2 3 4 5 6 7 8 9 10 11 12 THE USE OF QUADRATIC REGULARIZATION WITH A CUBIC DESCENT CONDITION FOR UNCONSTRAINED OPTIMIZATION E. G. BIRGIN AND J. M. MARTíNEZ Abstract. Cubic-regularization and trust-region

More information

New Inexact Line Search Method for Unconstrained Optimization 1,2

New Inexact Line Search Method for Unconstrained Optimization 1,2 journal of optimization theory and applications: Vol. 127, No. 2, pp. 425 446, November 2005 ( 2005) DOI: 10.1007/s10957-005-6553-6 New Inexact Line Search Method for Unconstrained Optimization 1,2 Z.

More information

Global Convergence Properties of the HS Conjugate Gradient Method

Global Convergence Properties of the HS Conjugate Gradient Method Applied Mathematical Sciences, Vol. 7, 2013, no. 142, 7077-7091 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.311638 Global Convergence Properties of the HS Conjugate Gradient Method

More information

Modification of the Armijo line search to satisfy the convergence properties of HS method

Modification of the Armijo line search to satisfy the convergence properties of HS method Université de Sfax Faculté des Sciences de Sfax Département de Mathématiques BP. 1171 Rte. Soukra 3000 Sfax Tunisia INTERNATIONAL CONFERENCE ON ADVANCES IN APPLIED MATHEMATICS 2014 Modification of the

More information

The Steepest Descent Algorithm for Unconstrained Optimization

The Steepest Descent Algorithm for Unconstrained Optimization The Steepest Descent Algorithm for Unconstrained Optimization Robert M. Freund February, 2014 c 2014 Massachusetts Institute of Technology. All rights reserved. 1 1 Steepest Descent Algorithm The problem

More information

Handling Nonpositive Curvature in a Limited Memory Steepest Descent Method

Handling Nonpositive Curvature in a Limited Memory Steepest Descent Method Handling Nonpositive Curvature in a Limited Memory Steepest Descent Method Fran E. Curtis and Wei Guo Department of Industrial and Systems Engineering, Lehigh University, USA COR@L Technical Report 14T-011-R1

More information

Interior Point Methods in Mathematical Programming

Interior Point Methods in Mathematical Programming Interior Point Methods in Mathematical Programming Clóvis C. Gonzaga Federal University of Santa Catarina, Brazil Journées en l honneur de Pierre Huard Paris, novembre 2008 01 00 11 00 000 000 000 000

More information

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications Weijun Zhou 28 October 20 Abstract A hybrid HS and PRP type conjugate gradient method for smooth

More information

Optimal Newton-type methods for nonconvex smooth optimization problems

Optimal Newton-type methods for nonconvex smooth optimization problems Optimal Newton-type methods for nonconvex smooth optimization problems Coralia Cartis, Nicholas I. M. Gould and Philippe L. Toint June 9, 20 Abstract We consider a general class of second-order iterations

More information

Spectral gradient projection method for solving nonlinear monotone equations

Spectral gradient projection method for solving nonlinear monotone equations Journal of Computational and Applied Mathematics 196 (2006) 478 484 www.elsevier.com/locate/cam Spectral gradient projection method for solving nonlinear monotone equations Li Zhang, Weijun Zhou Department

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

R-linear convergence of limited memory steepest descent

R-linear convergence of limited memory steepest descent IMA Journal of Numerical Analysis (2018) 38, 720 742 doi: 10.1093/imanum/drx016 Advance Access publication on April 24, 2017 R-linear convergence of limited memory steepest descent Fran E. Curtis and Wei

More information

Interior Point Methods for Mathematical Programming

Interior Point Methods for Mathematical Programming Interior Point Methods for Mathematical Programming Clóvis C. Gonzaga Federal University of Santa Catarina, Florianópolis, Brazil EURO - 2013 Roma Our heroes Cauchy Newton Lagrange Early results Unconstrained

More information

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

More information

Suppose that the approximate solutions of Eq. (1) satisfy the condition (3). Then (1) if η = 0 in the algorithm Trust Region, then lim inf.

Suppose that the approximate solutions of Eq. (1) satisfy the condition (3). Then (1) if η = 0 in the algorithm Trust Region, then lim inf. Maria Cameron 1. Trust Region Methods At every iteration the trust region methods generate a model m k (p), choose a trust region, and solve the constraint optimization problem of finding the minimum of

More information

A Modified Hestenes-Stiefel Conjugate Gradient Method and Its Convergence

A Modified Hestenes-Stiefel Conjugate Gradient Method and Its Convergence Journal of Mathematical Research & Exposition Mar., 2010, Vol. 30, No. 2, pp. 297 308 DOI:10.3770/j.issn:1000-341X.2010.02.013 Http://jmre.dlut.edu.cn A Modified Hestenes-Stiefel Conjugate Gradient Method

More information

8 Numerical methods for unconstrained problems

8 Numerical methods for unconstrained problems 8 Numerical methods for unconstrained problems Optimization is one of the important fields in numerical computation, beside solving differential equations and linear systems. We can see that these fields

More information

Inexact Spectral Projected Gradient Methods on Convex Sets

Inexact Spectral Projected Gradient Methods on Convex Sets Inexact Spectral Projected Gradient Methods on Convex Sets Ernesto G. Birgin José Mario Martínez Marcos Raydan March 26, 2003 Abstract A new method is introduced for large scale convex constrained optimization.

More information

Math Introduction to Numerical Analysis - Class Notes. Fernando Guevara Vasquez. Version Date: January 17, 2012.

Math Introduction to Numerical Analysis - Class Notes. Fernando Guevara Vasquez. Version Date: January 17, 2012. Math 5620 - Introduction to Numerical Analysis - Class Notes Fernando Guevara Vasquez Version 1990. Date: January 17, 2012. 3 Contents 1. Disclaimer 4 Chapter 1. Iterative methods for solving linear systems

More information

0.1. Linear transformations

0.1. Linear transformations Suggestions for midterm review #3 The repetitoria are usually not complete; I am merely bringing up the points that many people didn t now on the recitations Linear transformations The following mostly

More information

Chapter 8 Gradient Methods

Chapter 8 Gradient Methods Chapter 8 Gradient Methods An Introduction to Optimization Spring, 2014 Wei-Ta Chu 1 Introduction Recall that a level set of a function is the set of points satisfying for some constant. Thus, a point

More information

The Skorokhod reflection problem for functions with discontinuities (contractive case)

The Skorokhod reflection problem for functions with discontinuities (contractive case) The Skorokhod reflection problem for functions with discontinuities (contractive case) TAKIS KONSTANTOPOULOS Univ. of Texas at Austin Revised March 1999 Abstract Basic properties of the Skorokhod reflection

More information

Vector Spaces. Vector space, ν, over the field of complex numbers, C, is a set of elements a, b,..., satisfying the following axioms.

Vector Spaces. Vector space, ν, over the field of complex numbers, C, is a set of elements a, b,..., satisfying the following axioms. Vector Spaces Vector space, ν, over the field of complex numbers, C, is a set of elements a, b,..., satisfying the following axioms. For each two vectors a, b ν there exists a summation procedure: a +

More information

Dept. of Mathematics, University of Dundee, Dundee DD1 4HN, Scotland, UK.

Dept. of Mathematics, University of Dundee, Dundee DD1 4HN, Scotland, UK. A LIMITED MEMORY STEEPEST DESCENT METHOD ROGER FLETCHER Dept. of Mathematics, University of Dundee, Dundee DD1 4HN, Scotland, UK. e-mail fletcher@uk.ac.dundee.mcs Edinburgh Research Group in Optimization

More information

On the steplength selection in gradient methods for unconstrained optimization

On the steplength selection in gradient methods for unconstrained optimization On the steplength selection in gradient methods for unconstrained optimization Daniela di Serafino a,, Valeria Ruggiero b, Gerardo Toraldo c, Luca Zanni d a Department of Mathematics and Physics, University

More information

Iterative Methods for Solving A x = b

Iterative Methods for Solving A x = b Iterative Methods for Solving A x = b A good (free) online source for iterative methods for solving A x = b is given in the description of a set of iterative solvers called templates found at netlib: http

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

A PROJECTED HESSIAN GAUSS-NEWTON ALGORITHM FOR SOLVING SYSTEMS OF NONLINEAR EQUATIONS AND INEQUALITIES

A PROJECTED HESSIAN GAUSS-NEWTON ALGORITHM FOR SOLVING SYSTEMS OF NONLINEAR EQUATIONS AND INEQUALITIES IJMMS 25:6 2001) 397 409 PII. S0161171201002290 http://ijmms.hindawi.com Hindawi Publishing Corp. A PROJECTED HESSIAN GAUSS-NEWTON ALGORITHM FOR SOLVING SYSTEMS OF NONLINEAR EQUATIONS AND INEQUALITIES

More information

Lectures 2 3 : Wigner s semicircle law

Lectures 2 3 : Wigner s semicircle law Fall 009 MATH 833 Random Matrices B. Való Lectures 3 : Wigner s semicircle law Notes prepared by: M. Koyama As we set up last wee, let M n = [X ij ] n i,j=1 be a symmetric n n matrix with Random entries

More information

Residual iterative schemes for largescale linear systems

Residual iterative schemes for largescale linear systems Universidad Central de Venezuela Facultad de Ciencias Escuela de Computación Lecturas en Ciencias de la Computación ISSN 1316-6239 Residual iterative schemes for largescale linear systems William La Cruz

More information

On Lagrange multipliers of trust-region subproblems

On Lagrange multipliers of trust-region subproblems On Lagrange multipliers of trust-region subproblems Ladislav Lukšan, Ctirad Matonoha, Jan Vlček Institute of Computer Science AS CR, Prague Programy a algoritmy numerické matematiky 14 1.- 6. června 2008

More information

Course Notes: Week 1

Course Notes: Week 1 Course Notes: Week 1 Math 270C: Applied Numerical Linear Algebra 1 Lecture 1: Introduction (3/28/11) We will focus on iterative methods for solving linear systems of equations (and some discussion of eigenvalues

More information

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by: Newton s Method Suppose we want to solve: (P:) min f (x) At x = x, f (x) can be approximated by: n x R. f (x) h(x) := f ( x)+ f ( x) T (x x)+ (x x) t H ( x)(x x), 2 which is the quadratic Taylor expansion

More information

MATH 205C: STATIONARY PHASE LEMMA

MATH 205C: STATIONARY PHASE LEMMA MATH 205C: STATIONARY PHASE LEMMA For ω, consider an integral of the form I(ω) = e iωf(x) u(x) dx, where u Cc (R n ) complex valued, with support in a compact set K, and f C (R n ) real valued. Thus, I(ω)

More information

Preconditioned conjugate gradient algorithms with column scaling

Preconditioned conjugate gradient algorithms with column scaling Proceedings of the 47th IEEE Conference on Decision and Control Cancun, Mexico, Dec. 9-11, 28 Preconditioned conjugate gradient algorithms with column scaling R. Pytla Institute of Automatic Control and

More information

1 Last time: least-squares problems

1 Last time: least-squares problems MATH Linear algebra (Fall 07) Lecture Last time: least-squares problems Definition. If A is an m n matrix and b R m, then a least-squares solution to the linear system Ax = b is a vector x R n such that

More information

OPTIMAL SCALING FOR P -NORMS AND COMPONENTWISE DISTANCE TO SINGULARITY

OPTIMAL SCALING FOR P -NORMS AND COMPONENTWISE DISTANCE TO SINGULARITY published in IMA Journal of Numerical Analysis (IMAJNA), Vol. 23, 1-9, 23. OPTIMAL SCALING FOR P -NORMS AND COMPONENTWISE DISTANCE TO SINGULARITY SIEGFRIED M. RUMP Abstract. In this note we give lower

More information

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares Robert Bridson October 29, 2008 1 Hessian Problems in Newton Last time we fixed one of plain Newton s problems by introducing line search

More information

ON THE CONNECTION BETWEEN THE CONJUGATE GRADIENT METHOD AND QUASI-NEWTON METHODS ON QUADRATIC PROBLEMS

ON THE CONNECTION BETWEEN THE CONJUGATE GRADIENT METHOD AND QUASI-NEWTON METHODS ON QUADRATIC PROBLEMS ON THE CONNECTION BETWEEN THE CONJUGATE GRADIENT METHOD AND QUASI-NEWTON METHODS ON QUADRATIC PROBLEMS Anders FORSGREN Tove ODLAND Technical Report TRITA-MAT-203-OS-03 Department of Mathematics KTH Royal

More information

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL) Part 3: Trust-region methods for unconstrained optimization Nick Gould (RAL) minimize x IR n f(x) MSc course on nonlinear optimization UNCONSTRAINED MINIMIZATION minimize x IR n f(x) where the objective

More information

ECE580 Fall 2015 Solution to Midterm Exam 1 October 23, Please leave fractions as fractions, but simplify them, etc.

ECE580 Fall 2015 Solution to Midterm Exam 1 October 23, Please leave fractions as fractions, but simplify them, etc. ECE580 Fall 2015 Solution to Midterm Exam 1 October 23, 2015 1 Name: Solution Score: /100 This exam is closed-book. You must show ALL of your work for full credit. Please read the questions carefully.

More information

Introduction to Nonlinear Optimization Paul J. Atzberger

Introduction to Nonlinear Optimization Paul J. Atzberger Introduction to Nonlinear Optimization Paul J. Atzberger Comments should be sent to: atzberg@math.ucsb.edu Introduction We shall discuss in these notes a brief introduction to nonlinear optimization concepts,

More information

Exercise Solutions to Functional Analysis

Exercise Solutions to Functional Analysis Exercise Solutions to Functional Analysis Note: References refer to M. Schechter, Principles of Functional Analysis Exersize that. Let φ,..., φ n be an orthonormal set in a Hilbert space H. Show n f n

More information

L p Functions. Given a measure space (X, µ) and a real number p [1, ), recall that the L p -norm of a measurable function f : X R is defined by

L p Functions. Given a measure space (X, µ) and a real number p [1, ), recall that the L p -norm of a measurable function f : X R is defined by L p Functions Given a measure space (, µ) and a real number p [, ), recall that the L p -norm of a measurable function f : R is defined by f p = ( ) /p f p dµ Note that the L p -norm of a function f may

More information

Hilbert Spaces. Hilbert space is a vector space with some extra structure. We start with formal (axiomatic) definition of a vector space.

Hilbert Spaces. Hilbert space is a vector space with some extra structure. We start with formal (axiomatic) definition of a vector space. Hilbert Spaces Hilbert space is a vector space with some extra structure. We start with formal (axiomatic) definition of a vector space. Vector Space. Vector space, ν, over the field of complex numbers,

More information

Handling nonpositive curvature in a limited memory steepest descent method

Handling nonpositive curvature in a limited memory steepest descent method IMA Journal of Numerical Analysis (2016) 36, 717 742 doi:10.1093/imanum/drv034 Advance Access publication on July 8, 2015 Handling nonpositive curvature in a limited memory steepest descent method Frank

More information

SECTION: CONTINUOUS OPTIMISATION LECTURE 4: QUASI-NEWTON METHODS

SECTION: CONTINUOUS OPTIMISATION LECTURE 4: QUASI-NEWTON METHODS SECTION: CONTINUOUS OPTIMISATION LECTURE 4: QUASI-NEWTON METHODS HONOUR SCHOOL OF MATHEMATICS, OXFORD UNIVERSITY HILARY TERM 2005, DR RAPHAEL HAUSER 1. The Quasi-Newton Idea. In this lecture we will discuss

More information

The cyclic Barzilai Borwein method for unconstrained optimization

The cyclic Barzilai Borwein method for unconstrained optimization IMA Journal of Numerical Analysis Advance Access published March 24, 2006 IMA Journal of Numerical Analysis Pageof24 doi:0.093/imanum/drl006 The cyclic Barzilai Borwein method for unconstrained optimization

More information

First Published on: 11 October 2006 To link to this article: DOI: / URL:

First Published on: 11 October 2006 To link to this article: DOI: / URL: his article was downloaded by:[universitetsbiblioteet i Bergen] [Universitetsbiblioteet i Bergen] On: 12 March 2007 Access Details: [subscription number 768372013] Publisher: aylor & Francis Informa Ltd

More information

FINE TUNING NESTEROV S STEEPEST DESCENT ALGORITHM FOR DIFFERENTIABLE CONVEX PROGRAMMING. 1. Introduction. We study the nonlinear programming problem

FINE TUNING NESTEROV S STEEPEST DESCENT ALGORITHM FOR DIFFERENTIABLE CONVEX PROGRAMMING. 1. Introduction. We study the nonlinear programming problem FINE TUNING NESTEROV S STEEPEST DESCENT ALGORITHM FOR DIFFERENTIABLE CONVEX PROGRAMMING CLÓVIS C. GONZAGA AND ELIZABETH W. KARAS Abstract. We modify the first order algorithm for convex programming described

More information

Page 404. Lecture 22: Simple Harmonic Oscillator: Energy Basis Date Given: 2008/11/19 Date Revised: 2008/11/19

Page 404. Lecture 22: Simple Harmonic Oscillator: Energy Basis Date Given: 2008/11/19 Date Revised: 2008/11/19 Page 404 Lecture : Simple Harmonic Oscillator: Energy Basis Date Given: 008/11/19 Date Revised: 008/11/19 Coordinate Basis Section 6. The One-Dimensional Simple Harmonic Oscillator: Coordinate Basis Page

More information

An Alternative Three-Term Conjugate Gradient Algorithm for Systems of Nonlinear Equations

An Alternative Three-Term Conjugate Gradient Algorithm for Systems of Nonlinear Equations International Journal of Mathematical Modelling & Computations Vol. 07, No. 02, Spring 2017, 145-157 An Alternative Three-Term Conjugate Gradient Algorithm for Systems of Nonlinear Equations L. Muhammad

More information

5 Compact linear operators

5 Compact linear operators 5 Compact linear operators One of the most important results of Linear Algebra is that for every selfadjoint linear map A on a finite-dimensional space, there exists a basis consisting of eigenvectors.

More information

AN EIGENVALUE STUDY ON THE SUFFICIENT DESCENT PROPERTY OF A MODIFIED POLAK-RIBIÈRE-POLYAK CONJUGATE GRADIENT METHOD S.

AN EIGENVALUE STUDY ON THE SUFFICIENT DESCENT PROPERTY OF A MODIFIED POLAK-RIBIÈRE-POLYAK CONJUGATE GRADIENT METHOD S. Bull. Iranian Math. Soc. Vol. 40 (2014), No. 1, pp. 235 242 Online ISSN: 1735-8515 AN EIGENVALUE STUDY ON THE SUFFICIENT DESCENT PROPERTY OF A MODIFIED POLAK-RIBIÈRE-POLYAK CONJUGATE GRADIENT METHOD S.

More information

Optimization Tutorial 1. Basic Gradient Descent

Optimization Tutorial 1. Basic Gradient Descent E0 270 Machine Learning Jan 16, 2015 Optimization Tutorial 1 Basic Gradient Descent Lecture by Harikrishna Narasimhan Note: This tutorial shall assume background in elementary calculus and linear algebra.

More information

A family of derivative-free conjugate gradient methods for large-scale nonlinear systems of equations

A family of derivative-free conjugate gradient methods for large-scale nonlinear systems of equations Journal of Computational Applied Mathematics 224 (2009) 11 19 Contents lists available at ScienceDirect Journal of Computational Applied Mathematics journal homepage: www.elsevier.com/locate/cam A family

More information

Interior-Point Methods for Linear Optimization

Interior-Point Methods for Linear Optimization Interior-Point Methods for Linear Optimization Robert M. Freund and Jorge Vera March, 204 c 204 Robert M. Freund and Jorge Vera. All rights reserved. Linear Optimization with a Logarithmic Barrier Function

More information

Improved Damped Quasi-Newton Methods for Unconstrained Optimization

Improved Damped Quasi-Newton Methods for Unconstrained Optimization Improved Damped Quasi-Newton Methods for Unconstrained Optimization Mehiddin Al-Baali and Lucio Grandinetti August 2015 Abstract Recently, Al-Baali (2014) has extended the damped-technique in the modified

More information

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2. APPENDIX A Background Mathematics A. Linear Algebra A.. Vector algebra Let x denote the n-dimensional column vector with components 0 x x 2 B C @. A x n Definition 6 (scalar product). The scalar product

More information

Studying convergence of gradient algorithms via optimal experimental design theory

Studying convergence of gradient algorithms via optimal experimental design theory Studying convergence of gradient algorithms via optimal experimental design theory R. Haycroft, L. Pronzato, H.P. Wynn, and A. Zhigljavsky Summary. We study the family of gradient algorithms for solving

More information

Cubic-regularization counterpart of a variable-norm trust-region method for unconstrained minimization

Cubic-regularization counterpart of a variable-norm trust-region method for unconstrained minimization Cubic-regularization counterpart of a variable-norm trust-region method for unconstrained minimization J. M. Martínez M. Raydan November 15, 2015 Abstract In a recent paper we introduced a trust-region

More information

Search Directions for Unconstrained Optimization

Search Directions for Unconstrained Optimization 8 CHAPTER 8 Search Directions for Unconstrained Optimization In this chapter we study the choice of search directions used in our basic updating scheme x +1 = x + t d. for solving P min f(x). x R n All

More information

TMA 4180 Optimeringsteori THE CONJUGATE GRADIENT METHOD

TMA 4180 Optimeringsteori THE CONJUGATE GRADIENT METHOD INTRODUCTION TMA 48 Optimeringsteori THE CONJUGATE GRADIENT METHOD H. E. Krogstad, IMF, Spring 28 This note summarizes main points in the numerical analysis of the Conjugate Gradient (CG) method. Most

More information

Some definitions. Math 1080: Numerical Linear Algebra Chapter 5, Solving Ax = b by Optimization. A-inner product. Important facts

Some definitions. Math 1080: Numerical Linear Algebra Chapter 5, Solving Ax = b by Optimization. A-inner product. Important facts Some definitions Math 1080: Numerical Linear Algebra Chapter 5, Solving Ax = b by Optimization M. M. Sussman sussmanm@math.pitt.edu Office Hours: MW 1:45PM-2:45PM, Thack 622 A matrix A is SPD (Symmetric

More information

Vectors in Function Spaces

Vectors in Function Spaces Jim Lambers MAT 66 Spring Semester 15-16 Lecture 18 Notes These notes correspond to Section 6.3 in the text. Vectors in Function Spaces We begin with some necessary terminology. A vector space V, also

More information

MATH 304 Linear Algebra Lecture 20: The Gram-Schmidt process (continued). Eigenvalues and eigenvectors.

MATH 304 Linear Algebra Lecture 20: The Gram-Schmidt process (continued). Eigenvalues and eigenvectors. MATH 304 Linear Algebra Lecture 20: The Gram-Schmidt process (continued). Eigenvalues and eigenvectors. Orthogonal sets Let V be a vector space with an inner product. Definition. Nonzero vectors v 1,v

More information

Global convergence of trust-region algorithms for constrained minimization without derivatives

Global convergence of trust-region algorithms for constrained minimization without derivatives Global convergence of trust-region algorithms for constrained minimization without derivatives P.D. Conejo E.W. Karas A.A. Ribeiro L.G. Pedroso M. Sachine September 27, 2012 Abstract In this work we propose

More information

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION 15-382 COLLECTIVE INTELLIGENCE - S19 LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION TEACHER: GIANNI A. DI CARO WHAT IF WE HAVE ONE SINGLE AGENT PSO leverages the presence of a swarm: the outcome

More information

Linear Solvers. Andrew Hazel

Linear Solvers. Andrew Hazel Linear Solvers Andrew Hazel Introduction Thus far we have talked about the formulation and discretisation of physical problems...... and stopped when we got to a discrete linear system of equations. Introduction

More information

Optimality Conditions for Constrained Optimization

Optimality Conditions for Constrained Optimization 72 CHAPTER 7 Optimality Conditions for Constrained Optimization 1. First Order Conditions In this section we consider first order optimality conditions for the constrained problem P : minimize f 0 (x)

More information

Lecture 10: October 27, 2016

Lecture 10: October 27, 2016 Mathematical Toolkit Autumn 206 Lecturer: Madhur Tulsiani Lecture 0: October 27, 206 The conjugate gradient method In the last lecture we saw the steepest descent or gradient descent method for finding

More information

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing Afonso S. Bandeira April 9, 2015 1 The Johnson-Lindenstrauss Lemma Suppose one has n points, X = {x 1,..., x n }, in R d with d very

More information

Econ Slides from Lecture 7

Econ Slides from Lecture 7 Econ 205 Sobel Econ 205 - Slides from Lecture 7 Joel Sobel August 31, 2010 Linear Algebra: Main Theory A linear combination of a collection of vectors {x 1,..., x k } is a vector of the form k λ ix i for

More information

The Conjugate Gradient Method

The Conjugate Gradient Method CHAPTER The Conjugate Gradient Method Exercise.: A-norm Let A = LL be a Cholesy factorization of A, i.e.l is lower triangular with positive diagonal elements. The A-norm then taes the form x A = p x T

More information

On fast trust region methods for quadratic models with linear constraints. M.J.D. Powell

On fast trust region methods for quadratic models with linear constraints. M.J.D. Powell DAMTP 2014/NA02 On fast trust region methods for quadratic models with linear constraints M.J.D. Powell Abstract: Quadratic models Q k (x), x R n, of the objective function F (x), x R n, are used by many

More information

Lecture Notes: Geometric Considerations in Unconstrained Optimization

Lecture Notes: Geometric Considerations in Unconstrained Optimization Lecture Notes: Geometric Considerations in Unconstrained Optimization James T. Allison February 15, 2006 The primary objectives of this lecture on unconstrained optimization are to: Establish connections

More information

New hybrid conjugate gradient methods with the generalized Wolfe line search

New hybrid conjugate gradient methods with the generalized Wolfe line search Xu and Kong SpringerPlus (016)5:881 DOI 10.1186/s40064-016-5-9 METHODOLOGY New hybrid conjugate gradient methods with the generalized Wolfe line search Open Access Xiao Xu * and Fan yu Kong *Correspondence:

More information

MATH 1120 (LINEAR ALGEBRA 1), FINAL EXAM FALL 2011 SOLUTIONS TO PRACTICE VERSION

MATH 1120 (LINEAR ALGEBRA 1), FINAL EXAM FALL 2011 SOLUTIONS TO PRACTICE VERSION MATH (LINEAR ALGEBRA ) FINAL EXAM FALL SOLUTIONS TO PRACTICE VERSION Problem (a) For each matrix below (i) find a basis for its column space (ii) find a basis for its row space (iii) determine whether

More information

The speed of Shor s R-algorithm

The speed of Shor s R-algorithm IMA Journal of Numerical Analysis 2008) 28, 711 720 doi:10.1093/imanum/drn008 Advance Access publication on September 12, 2008 The speed of Shor s R-algorithm J. V. BURKE Department of Mathematics, University

More information

An Efficient Modification of Nonlinear Conjugate Gradient Method

An Efficient Modification of Nonlinear Conjugate Gradient Method Malaysian Journal of Mathematical Sciences 10(S) March : 167-178 (2016) Special Issue: he 10th IM-G International Conference on Mathematics, Statistics and its Applications 2014 (ICMSA 2014) MALAYSIAN

More information

Chapter 10 Conjugate Direction Methods

Chapter 10 Conjugate Direction Methods Chapter 10 Conjugate Direction Methods An Introduction to Optimization Spring, 2012 1 Wei-Ta Chu 2012/4/13 Introduction Conjugate direction methods can be viewed as being intermediate between the method

More information

Mathematical foundations - linear algebra

Mathematical foundations - linear algebra Mathematical foundations - linear algebra Andrea Passerini passerini@disi.unitn.it Machine Learning Vector space Definition (over reals) A set X is called a vector space over IR if addition and scalar

More information

Higher-Order Methods

Higher-Order Methods Higher-Order Methods Stephen J. Wright 1 2 Computer Sciences Department, University of Wisconsin-Madison. PCMI, July 2016 Stephen Wright (UW-Madison) Higher-Order Methods PCMI, July 2016 1 / 25 Smooth

More information

1 Directional Derivatives and Differentiability

1 Directional Derivatives and Differentiability Wednesday, January 18, 2012 1 Directional Derivatives and Differentiability Let E R N, let f : E R and let x 0 E. Given a direction v R N, let L be the line through x 0 in the direction v, that is, L :=

More information