Gradient methods exploiting spectral properties

Size: px

Start display at page:

Download "Gradient methods exploiting spectral properties"

Sheryl Hardy
5 years ago
Views:

1 Noname manuscript No. (will be inserted by the editor) Gradient methods exploiting spectral properties Yaui Huang Yu-Hong Dai Xin-Wei Liu Received: date / Accepted: date Abstract A new stepsize is derived for the gradient method. It is shown that this stepsize asymptotically converges to the reciprocal of the largest eigenvalue of the Hessian of the objective function. Based on this spectral property, we develop a monotone gradient method that taes a certain number of steps with the asymptotic optimal stepsize proposed by Dai and Yang (Computational Optimization and Applications, 2006, 33(1): 73-88) followed by some short steps associated with the new stepsize. By employing one step retard of the asymptotic optimal stepsize, a nonmonotone variant is suggested. R-linear convergences are established for the proposed methods. Numerical comparisons with other recent successful methods demonstrate the efficiency of our methods. Keywords gradient method spectral property convergence quadratic optimization This wor was supported by the National Natural Science Foundation of China (Grant Nos , , , ) and the National 973 Program of China (Grant No. 2015CB856002). Yaui Huang Institute of Mathematics, Hebei University of Technology, Tianjin, China huangyaui2006@gmail.com Yu-Hong Dai State Key Laboratory of Scientific and Engineering Computing, Institute of Computational Mathematics and Scientific/Engineering Computing, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China dyh@lsec.cc.ac.cn Xin-Wei Liu Institute of Mathematics, Hebei University of Technology, Tianjin, China mathlxw@hebut.edu.cn

2 2 Yaui Huang et al. 1 Introduction We consider the problem of minimizing a convex quadratic function min f(x) = 1 2 xt Ax b T x, (1) where b R n and A R n n is symmetric positive definite with eigenvalues 0 < λ 1 λ 2... and condition number κ = λn λ 1. This problem is one of the simplest non-trivial non-linear programming problems. Solving problem (1) is usually a pre-requisite for a method to be generalized to solve more general problems. In addition, various optimization problems arising in many applications including machine learning [6], sparse reconstruction [18], nonnegative matrix factorization [25, 27] can be formulated as the form of (1), possibly with the addition of regularization or bound constraints. The simplest method for solving (1) is the gradient method which updates iterates by x +1 = x α g, (2) where g = f(x ) and α > 0 is the stepsize determined in some way. The classic steepest descent (SD) method can be dated bac to Cauchy [5], who suggested to compute the stepsize by exact line search: α SD = arg min f(x αg ) = gt g α g T Ag. (3) It has been shown that the method converges linearly [1] with Q-linear factor κ 1 κ+1. Thus, the SD method can be very slow especially when the condition number is large. Further analysis shows that the gradients will asymptotically produce zigzags between two orthogonal directions associated with the two eigenvectors corresponding to λ 1 and, see [20,29] for more details. In 1988, from the point of view of quasi-newton method, Barzilai and Borwein [2] designed the following two ingenious stepsizes that significantly improve the performance of gradient methods: α BB1 = st 1 s 1 s T 1 y, α BB2 1 = st 1 y 1 y 1 T y, (4) 1 where s 1 = x x 1 and y 1 = g g 1. The BB method was shown to be global convergent for general n-dimensional strictly convex quadratics [30] and the convergence rate is R-linear [11]. Recently, the BB method and its variants have successfully been extended to general unconstrained problems [31], to constrained optimization problems [3, 24] and to various applications [25,26,28], see also [4,,16,19,33] and references therein. Let {ξ 1, ξ 2,..., ξ n } be the orthonormal eigenvectors associated with the eigenvalues. Denote the components of g along the eigenvectors ξ i by µ i, i = 1,..., n, i.e., n g = µ i ξ i, i=1

3 Gradient methods exploiting spectral properties 3 which together with the update rule (2) gives that where g +1 = g α Ag = µ +1 i (I α j A)g 1 = j=1 = µ i (1 α λ i ) = µ 1 i n i=1 (1 α j λ i ), j=1 µ +1 i ξ i, which implies that the closer α to 1 λ i, the smaller µ +1 i. In addition, if µ i = 0, the corresponding component vanishes at all subsequent iterations. Since the SD method will asymptotically zigzag between ξ 1 and ξ n, a natural way to brea the zigzagging pattern is einating the component µ 1 or µ 1 1 n which can be achieved by employing a stepsize approximates λ 1 or. One pioneer wor in this line is due to Yuan [14,32] who derived the following stepsize by imposing finite termination for two-dimensional convex quadratics: α Y = (1/α SD 2. )2 + 4 g 2 /(α 1 SD g 1 ) 2 + (1/α 1 SD + 1/αSD ) 1 1/αSD (5) Dai and Yuan [14] further suggested a new gradient method whose stepsize is given by α DY = α SD, if mod(,4)< 2; (6) α Y, otherwise. The DY method (6) eeps monotonicity and appears better than the nonmonotone BB method, see [14]. It was shown by De Asmundis et al. [15] that the stepsize α Y 1 cnoverges to if the SD method is applied to problem (1). That is, occasionally employing the stepsize α Y along the SD method will enhance the eination of the component µ n. Recently, Gonzaga and Schneider [23] suggested a monotone method where all stepsizes are of the form (3). Particularly, their method approximates 1 by a short stepsize calculated by replacing g in (3) with g = (I ηa)g, where η is a large number. Nonmonotone gradient methods exploit spectral properties have been developed as well. Frassoldati et al. [21] developed a new short stepsize by maximizing the next SD stepsize α+1 SD according to the current stepsize. They further suggested the ABB min 2 method which tries to exploit BB1 stepsizes close to 1 λ 1 by using short stepsizes to einate those components associated with large eigenvalues to force the gradient to be dominated by components corresponding to small eigenvalues. More recently, based on the favourable property of α Y, De Asmundis et al. [15] suggested to reuse it in a cyclic fashion after a certain number of SD steps. Precisely, their approach, referred to as the SDC method, employs the stepsize α SDC α SD, if mod(, h + s) < h; = (7) αt Y, otherwise, with t = max{i : mod(i, h + s) = h},

4 4 Yaui Huang et al. where h 2 and s 1. They also proposed a monotone version of (7) by imposing safeguards on the stepsizes. One common characteristic of the aforementioned methods is maing use of spectral properties of the stepsizes. The recent study in [16] points out that gradient methods using long and short stepsizes that attempt to exploit their spectral properties show better numerical behaviour both in the quadratic and general cases. For more wors on gradient methods, see [7,9,15,16,21,23,33, 34] and references therein. Recently, Dai and Yang [12] introduced a gradient method with a new stepsize that possesses similar spectral property as α Y. In particular, their stepsize is calculated by α AOP T = g Ag, (8) 2 which asymptotically converges to λ 1+ that is optimal in the sense that it minimizes I αa, see for example [12,17]. Since α AOP T α SD, the Dai- Yang method (8) is monotone but without exact line searches. In addition, the method converges Q-linearly and enjoys the same rate as the SD method. More importantly, it is able to recover the eigenvectors ξ 1 and ξ n. In this paper, based on the Dai-Yang method (8), we propose a new stepsize to exploit the spectral property. Particularly, our new stepsize is given by where d = dt d ᾱ = d T Ad, (9) g 1 g 1 g g. () We show that the stepsize ᾱ asymptotically converges to 1 if the Dai-Yang method (8) is applied to problem (1). Therefore, the stepsize ᾱ is helpful in einating the component µ n. Thans to this desired property, we are able to develop a new efficient gradient method by taing a certain number of steps with the asymptotic optimal stepsize α AOP T followed by some short steps which are determined by the smaller one of α AOP T and ᾱ 1. Thus, our method is monotone without exact line searches. We also consider a nonmonotone variant of the method which is same as the monotone one except using the stepsize α AOP T with one step retard. R-linear convergences of the proposed methods are established. Preinary numerical comparisons with the DY, ABB min 2 and SDC methods are presented to demonstrate the effiency of our methods. The paper is organized as follows. In Section 2 we analyze asymptotic spectral property of the stepsize ᾱ and present the new method as well as its nonmonotone variant. In Section 3 we show the proposed methods are R- linearly convergent. In Section 4 we report some numerical comparisons of our methods and the DY, ABB min 2 and SDC methods. Finally, a brief discussion is presented.

5 Gradient methods exploiting spectral properties 5 2 Proposed methods In this section we first analyze the spectral property of the stepsize ᾱ and then present our new gradient methods. 2.1 Spectral property of ᾱ We recall important properties of the method (8). Lemma 1 [12] For any starting point x 1 satisfying µ 1 1 0, µ 1 n 0, let {x } be the iterations generated by the method (8). Then we have that Furthermore, and which indicates that and where µ 2 1 i n = j=1 (µ2 1 j ) 2 µ 2 i n = j=1 (µ2 j )2 αaop T 2 =. λ 1 + g 1 g 1 + c 1 = sign(µ 1 1) c 1, if i = 1; 0, if i = 2,..., n 1; sign(µ 1 n) c 2, if i = n, sign(µ 1 1) c 1, if i = 1; 0, if i = 2,..., n 1; sign(µ 1 n) c 2, if i = n, g 1 g 1 g g = 2sign(µ1 1) c 1 ξ 1 g g = ±2 c 2 ξ n, λ (λ 1 + ), c 2 = 3λ 1 + 4(λ 1 + ). Lemma 1 indicates that the method (8) asymptotically reduces its search in the two-dimensional subspace spanned by ξ 1 and ξ n. In order to accelerate the method, we can employ some stepsize approximates 1 λ 1 or 1 to einate the component µ 1 or µ n. Note that the vector d tends to align the eigenvector ξ n. If we tae some gradient steps with α AOP T such that d ±2 c 2 ξ n, 1 the stepsize ᾱ will be an approximation of. The next theorem provides theoretical evidence for this.

6 6 Yaui Huang et al. Theorem 1 Under the conditions in Lemma 1, let {g } be the sequence generated by applying the method (8) to problem (1). Then we have ᾱ = 1. Proof From the definition () of d, we have d T g 1 T d = 2 2 g g 1 g (11) and d T Ad = gt 1 Ag 1 g gt Ag g 2 2 gt 1 Ag g 1 g, (12) which indicate that and dt d = 2 2 n j=1 µ 2 1 i n j=1 (µ2 1 j ) 2 µ 2 i n j=1 (µ2 j )2 = 2 2(c 1 c 2 ) (13) dt Ad = 2 n j=1 n j=1 λ i(µ 2 1 i ) 2 n + j=1 (µ2 1 j ) 2 λ i µ 2 1 i n j=1 (µ2 1 j ) 2 n j=1 λ i(µ 2 i ) 2 n j=1 (µ2 j )2 µ 2 i n j=1 (µ2 j )2 = 2(λ 1 c 1 + c 2 ) 2(λ 1 c 1 c 2 ) = 4 c 2. (14) It follows from (13), (14) and the definition (9) of ᾱ that ᾱ = d T d d T Ad = 2 2(c 1 c 2 ) 4 c 2 = 4(λ 1 + ) (2 2λ 1 ) 2 (3λ 1 + ) = 1. This completes the proof. Using the same argument as the one in Theorem 1, we get the following result.

7 Gradient methods exploiting spectral properties 7 Theorem 2 Under the conditions in Lemma 1, let {g } be the sequence generated by applying the method (8) to problem (1), we have ˆα = 1 λ 1, where ˆα = ˆd T ˆd ˆd T A ˆd with ˆd = g 1 g 1 + g g. 2.2 The algorithm Theorems 1 and 2 in the former subsection provide us the possibility of employing the two stepsizes ˆα and ᾱ to vanish the components µ 1 and µ n. However, the following example shows negative aspects of using ˆα. Particularly, we applied the method (8) to a problem in the form of (1) with A = diag{a 1, a 2,..., a n }, b = 0, (15) where a 1 = 1, a n = n and a i is randomly generated in (1, n), i = 2,..., n 1. Figure 1 presents the result of an instance with n = 00. We can see that ᾱ approximates 1 with satisfactory accuracy in few iterations. Although we did observe that after 00 iterations the value of ˆα 1 λ 1 is reduced by a factor of 0.01, ˆα converges to 1 λ 1 very slowly in the first few hundreds of iterations. Now we give a rough explanation of the above phenomenon. Since the gradient norm decreases very slowly, by the update rule (2) we have that g (i) g = (1 α λ i ) g(i) 1 g 1 g 1 g (1 α λ i ) g(i) 1 g 1. Suppose that α 0 satisfies 1 α 0 λ i 1 for all i = 1, 2,..., n and 0. This will be true since α AOP T 2 approximates λ 1+ as the iteration process goes on. Let d (i) be the i-th component of d, i.e., d (i) = g(i) g g(i) 1 g 1. For the stepsize ᾱ, 0, we consider the following two cases: Case 1. 1 α λ i 0.6. By trivial computation we now that after five steps the value of g(i) g is less than 8% of its initial value and eeps small at all subsequent iterations, thus d (i) can be neglected. Case 2. 1 α λ i > 0.6.

8 8 Yaui Huang et al g j^,! 1=6 1 j j7,! 1=6 n j iterations Fig. 1 Problem (15): convergence history of the sequence {ˆα } and {ᾱ } for the first 0 iterations of the method (8). g (i) g will not change much and thus d(i) (i) If 1 α λ i 0.9, the value of may not affect the value of ᾱ too much which indicates that the component can also be neglected. (ii) If 0.6 < 1 α λ i < 0.9, i.e., 0.1 < α λ i < 0.4 which implies that d (i) will be small in few iterations and is safe to be abandoned. g (i) g (iii) If 1 α λ i < 0.6, the value of changes signs and d(i) may increase. The above analysis shows that ᾱ will only be dominated by the components corresponding to those eigenvalues in (iii) of Case 2. Notice that the required inequality in (iii) implies that λ i > 4(λ1+λn) 5. If A has many large eigenvalues satisfy the condition, ᾱ will be an estimation of the reciprocal of 1 one of those eigenvalues which is also an approximation of. Once A has few such large eigenvalues, ᾱ will be dominated by the components corresponding to the first few largest eigenvalues which also yields a good estimation of 1. In other words, ᾱ will approximate 1 with satisfactory accuracy in few iterations. This coincides our observation in Figure 1. For the stepsize ˆα, when 1 α λ i > 0, the value of g + g(i) 1 g 1 may increase even if 1 α λ i 0.1. That is, most of the components of the gradient corresponding to those eigenvalues less than 1 α will affect the value of ˆα. 1 Thus, ˆα would not be a good approximation of λ 1 until those components 1 become very small. Moreover, a rough estimation of λ 1 may yield a large step which will increase most of the components of the gradient. As a result, it may impractical to use ˆα for einating the component µ 1. g (i)

9 Gradient methods exploiting spectral properties 9 Based on the above observations, our method only combines ᾱ with α AOP T. Note that, for quadratics, the BB1 stepsize α BB1 is the former SD stepsize (3). Motivated by this and the good performance of the BB method, the stepsize ᾱ with one step delay is employed for the proposed method. As ᾱ 1 is expected to be short to approximate 1, we resort to α AOP T if ᾱ 1 becomes large. More precisely, our method taes h steps with α AOP T and then s short steps, that is, α = { α AOP T, if mod(, h + s) < h; min{α AOP T, ᾱ 1 }, otherwise. (16) Remar 1 Although our method (16) loos lie the SDC method (7), they differ in the following ways: (i) the method (16) does not use exact line searches which are necessary for the SDC method; (ii) the method (16) does not reuse any stepsize while the SDC method uses the same Yuan s stepsize α Y for s steps; (iii) the method (16) is monotone while the SDC method is nonmontone and its monotone version is obtained by using a safeguard with 2α SD. -2,!1 AOP T 7, j7,! 1=6 n j iterations Fig. 2 Problem (15): history of the sequence {ᾱ } for the first 0 iterations of the gradient method with α AOP T 1. The analysis in the beginning of this subsection indicates that d (i) will be small if 1 α λ i is small. Thus, any stepsize approximates some 1 λ i is expected to enforce a small ᾱ. It is natural to consider the retard stepsize α 1 AOP T which has been analyzed in [8]. We can see from Figure 2 that, when the gradient method with α 1 AOP T is applied to problem (15), the stepsize ᾱ

10 Yaui Huang et al. 1 approximates with high accuracy if its value is small. Motivated by this observation, we suggest the following nonmonotone variant: { α 1 AOP T, if mod(, h + s) < h; α = min{α 1 AOP T (17), ᾱ 1 }, otherwise. 3 Convergence In this section, we establish the R-linear convergence of our method (16) and its nonmontone variant (17). Since the gradient method (2) is invariant under translations and rotations when applying to problem (1), we mae the following assumption throughout the analysis. Assumption 1. The matrix A is diagonal, i.e., A = diag{λ 1, λ 2,, }, (18) with 0 < λ 1 < λ 2 < <. In order to give a uniform analysis of the methods (16) and (17), we resort to the following property proposed by Dai [7]. Property (A) [7]. Suppose that there exist an integer m and positive constants M 1 λ 1 and M 2 such that (i) λ 1 α 1 M 1 ; (ii) for any integer l [1, n 1] and ɛ > 0, if G( j, l) ɛ and (g (l+1) j )2 M 2 ɛ hold for j [0, min{, m} 1], then α λ l+1. Here, G(, l) = l (g (i) )2. i=1 Dai [7] has proved that if A has the form (18) with 1 = λ 1 λ 2 and the stepsizes of gradient method (2) have the Property (A), then either g = 0 for some finite or the sequence { g } converges to zero R-linearly. Therefore, we show R-linear convergence of the proposed methods by proving that (16) and (17) satisfy Property (A). Theorem 3 Suppose that the sequence { g } is generated by any of the method (16) or (17) applied to n-dimensional quadratics with the matrix A has the form (18) and 1 = λ 1 < λ 2 < <. Then either g = 0 for some finite or the sequence { g } converges to zero R-linearly. Proof We show that the stepsize α has Property (A) with m = 2, M 1 = and M 2 = 2. Clearly, λ 1 α 1 for all 1. Thus, (i) of Property (A) holds with M 1 =.

11 Gradient methods exploiting spectral properties 11 and Notice that (α AOP T 1 (α AOP T ) 1 (α SD ) 1 ) 1 (α BB1 ) 1 = (α 1) SD 1. We are suffice to show (ii) holds for α BB1. If G( j, l) ɛ and (g (l+1) j )2 2ɛ hold for j [0, min{, m} 1], we have (α BB1 j ) 1 = (α SD j 1) 1 = n i=1 λ i(g (i) j 1 )2 n i=1 (g(i) j 1 )2 λ l+1 n i=l+1 (g(i) j 1 )2 G( j 1, l) + n i=l+1 (g(i) j 1 )2 λ l+1 ɛ l /2ɛ l λ l+1. This completes the proof. 4 Numerical results In this section, we compare the performance of our methods (16) and (17) with the DY method (6) in [14], with the ABB min 2 method in [21], and with the SDC method (7) in [15]. Since the SDC method performs better than its monotone counterpart for most problems, we only run SDC. All codes were written in Matlab (v.9.0-r2016a). All the runs were carried out on a laptop with an Intel Core i7, 2.9 GHz processor and 8 GB of RAM running Windows system. For all the test problems, the iteration was stopped if g ɛ g 1, (19) is satisfied with a given ɛ or the iteration number exceeds 20,000. Based on the observation from Figures 1 and 2, we tested h with values and 20 for our methods. As in [21], the parameter τ used by the ABB min 2 method is set to 0.9 for all the problems. Firstly, the compared methods were applied to 00 dimensional quadratic problems with different spectral distributions. In particular, the problem used in [,13,22,34] is employed, where A = QV Q T with Q = (I 2w 3 w T 3 )(I 2w 2 w T 2 )(I 2w 1 w T 1 ), and w 1, w 2, and w 3 are unitary random vectors, V = diag(v 1,..., v n ) is a diagonal matrix where v 1 = 1 and v n = κ, and v j is randomly generated between 1 and κ for j = 2,..., n 1. The entries of b were randomly generated between - and. Five sets of different spectral distributions presented in

12 12 Yaui Huang et al. Table 1 were carried out. For each problem set, three different values of condition number κ as well as three accuracies ɛ were tested. For each value of κ or ɛ, instances were randomly generated. Table 2 presents the number of iterations averaged over those instances using the starting point x 1 = (1,..., 1) T where the parameter pair (h, s) used for the SDC method was set to (8, 6) which is more efficient than other choices in our test. Table 1 Distributions of v j. Problem Spectrum 1 {v 2,..., v n 1 } (1, κ) {v 2,..., v n/5 } (1, 0) 2 {v n/5+1,..., v n 1 } ( κ 2, κ) {v 2,..., v n/2 } (1, 0) 3 {v n/2+1,..., v n 1 } ( κ 2, κ) {v 2,..., v 4n/5 } (1, 0) 4 {v 4n/5+1,..., v n 1 } ( κ 2, κ) {v 2,..., v n/5 } (1, 0) 5 {v n/5+1,..., v 4n/5 } (0, κ 2 ) {v 4n/5+1,..., v n 1 } ( κ 2, κ) We can see from Table 2 that, our method (16) is very competitive with the DY, ABB min 2 and SDC methods. For a fixed h, larger values of s seem to be preferable for our method (16). In addition, different settings of s lead to comparable results, with differences of less than % in the number of iterations for most of the test problems. For the first problem set, our method (16) outperforms the DY and SDC methods although the ABB min 2 method is the fastest one among the compared methods. Particularly, when a high accuracy is required, the method (16) with (h, s) = (20, 0) taes less than 1 6 and 1 4 iterations needed by the DY and SDC methods, respectively. As for the second to fourth problem sets, the method (16) with different settings performs better than the DY and ABB min 2 methods. Moreover, it is comparable to and even better than the SDC method if proper h and s are selected. Our method (16) also show evidence over the DY and SDC methods on the last problem set. The total number of iterations indicates the effectiveness of our method (16) as well. We have to point out that our method (16) and the DY method are monotone while the ABB min 2 and SDC methods are not. Moreover, the per-iteration cost of the ABB min 2 method is higher because except one matrixvector it needs four vector-vector multiplications to compute its stepsize at each iteration while our method (16) only taes three.

13 Gradient methods exploiting spectral properties 13 problem total Table 2 Number of averaged iterations of the methods (16), DY, ABB min 2 and SDC on problems in Table 1. ɛ (h, s) for the method (16) (, 20) (, 30) (, 50) (, 80) (, 0) (20, 20) (20, 30) (20, 50) (20, 80) (20, 0) DY ABB min 2 SDC Table 3 presents averaged number of iterations of our method (17) on problems in Table 1. For comparison purposes, the results of the DY, ABB min 2 and SDC methods are restated here. Similar performance as the method (16) shown in Table 2 can be ovserved. In particular, for each accuracy level, our method (17) taes around 30% less total iterations than the ABB min 2 method and even less than those by the DY and SDC methods. Notice that problems of the last set are difficult for the compared methods since they require more iterations than other four sets. However, our method (17) always dominates the compared three methods except with the pair (h, s) = (, 20). As compared with the method (16), the retard strategy used in the method (17) tends to improve the performance when h =. For the case h = 20, the method (17) is comparable to and even better than (16) in terms of total iterations. Secondly, we compared the methods on two large-scale real problems Laplace1(a) and Laplace1(b) described in [19]. Both of the problems require the solution of an elliptic system of linear equations arising from a 3D Laplacian on a box, discretized using a standard 7-point finite difference stencil. The solution is fixed by a Gaussian function whose center is (α, β, γ), multiplied by x(x 1). A parameter σ is used to control the rate of decay of the Gaussian. Both Laplace1(a) and Laplace1(b) have n = N 3 variables with N being the interior nodes taen in each coordinate direction and a highly sparse Hessian matrix with condition number We refer the readers to [19] for more details on these problems. In our tests, the associated parameters are set as

14 14 Yaui Huang et al. problem total Table 3 Number of averaged iterations of the methods (17), DY, ABB min 2 and SDC on problems in Table 1. ɛ (h, s) for the method (17) (, 20) (, 30) (, 50) (, 80) (, 0) (20, 20) (20, 30) (20, 50) (20, 80) (20, 0) DY ABB min 2 SDC follows: (a) σ = 20, α = β = γ = 0.5; (b) σ = 50, α = 0.4, β = 0.7, γ = 0.5. We used the null vector as the initial point. The number of iterations required by the compared methods on the two problems Laplace1(a) and Laplace1(b) are listed in Tables 4 and 5. From Table 4 we can see that, for the problem Laplace1(a), our method (16) with a large s outperforms the DY and SDC methods when the accuracy level is high. Moreover, with proper (h, s), it performs better than the ABB min 2 method which dominates the DY and SDC methods. For the problem Laplace1(b), our method (16) is still competitive with the compared methods. Table 5 shows similar tendency as the former one. However, for the problem Laplace1(b), our method (17) clearly outperforms the DY and SDC methods especially the requirement of the accuracy is high. Furthermore, our method (17) is very competitive with the ABB min 2 method in terms of total iterations. 5 Discussion Based on the asymptotic optimal stepsize, we have proposed a new monotone gradient method which employs a new stepsize that converges to the reciprocal of the largest eigenvalue of the Hessian of the objective function. A nonmonotone variant has been presented as well. R-linear convergences of the proposed methods have been established. Our numerical experiments showed the proposed methods are very competitive with other recent successful gradient methods.

15 Gradient methods exploiting spectral properties 15 Table 4 Number of iterations of the methods (16), DY, ABB min 2 and SDC on the 3D Laplacian problem. n ɛ (h, s) for the method (16) DY ABB min 2 SDC (, 20) (, 30) (, 50) (, 80) (, 0) (20, 20) (20, 30) (20, 50) (20, 80) (20, 0) Problem Laplace1(a) total Problem Laplace1(b) total As mentioned in Section 2, any stepsize approximates some 1 λ i is expected to enforce a small ᾱ. Thus, many stepsizes can be selected such as α BB1 and α BB2. We trivially combined the method (16) with the two BB stepsizes, that is, { α BB1, if mod(, h + s) < h; α = min{α BB1 (20), ᾱ 1 }, otherwise, and α = { α BB2, if mod(, h + s) < h; min{α BB2, ᾱ 1 }, otherwise. (21) Promising numerical results of the methods (20) and (21) have been observed. Sophisticated strategy exploits the favorable properties of the BB method and our new stepsize ᾱ will be expected. An interesting question is how to approximate 1 λ 1 to einate the corresponding components. This will be challenge because it needs a large stepsize that may increase those elements related to small eigenvalues and deteriorate the performance. Another future concern is to apply the proposed method to general unconstrained optimization and to constrained optimization.

16 16 Yaui Huang et al. Table 5 Number of iterations of the methods (17), DY, ABB min 2 and SDC on the 3D Laplacian problem. n ɛ (h, s) for the method (17) (, 20) (, 30) (, 50) (, 80) (, 0) (20, 20) (20, 30) (20, 50) (20, 80) (20, 0) DY ABB min 2 SDC Problem Laplace1(a) total Problem Laplace1(b) total References 1. Aaie, H.: On a successive transformation of probability distribution and its application to the analysis of the optimum gradient method. Ann. Inst. Stat. Math. 11(1), 1 16 (1959) 2. Barzilai, J., Borwein, J.M.: Two-point step size gradient methods. IMA J. Numer. Anal. 8(1), (1988) 3. Birgin, E.G., Martínez, J.M., Raydan, M.: Nonmonotone spectral projected gradient methods on convex sets. SIAM J. Optim. (4), (2000) 4. Birgin, E.G., Martínez, J.M., Raydan, M., et al.: Spectral projected gradient methods: review and perspectives. J. Stat. Softw. 60(3), (2014) 5. Cauchy, A.: Méthode générale pour la résolution des systemes déquations simultanées. Comp. Rend. Sci. Paris 25, (1847) 6. Cortes, C., Vapni, V.: Support-vector networs. Mach. Learn. 20(3), (1995) 7. Dai, Y.H.: Alternate step gradient method. Optimization 52(4-5), (2003) 8. Dai, Y.H., Al-Baali, M., Yang, X.: A positive Barzilai Borwein-lie stepsize and an extension for symmetric linear systems. In: Numerical Analysis and Optimization, pp Springer (2015) 9. Dai, Y.H., Fletcher, R.: Projected Barzilai Borwein methods for large-scale boxconstrained quadratic programming. Numer. Math. 0(1), (2005). Dai, Y.H., Huang, Y., Liu, X.W.: A family of spectral gradient methods for optimization. available at: HTML/2018/05/6628.html (2018) 11. Dai, Y.H., Liao, L.Z.: R-linear convergence of the Barzilai and Borwein gradient method. IMA J. Numer. Anal. 22(1), 1 (2002)

17 Gradient methods exploiting spectral properties Dai, Y.H., Yang, X.: A new gradient method with an optimal stepsize property. Comp. Optim. Appl. 33(1), (2006) 13. Dai, Y.H., Yuan, Y.X.: Alternate minimization gradient method. IMA J. Numer. Anal. 23(3), (2003) 14. Dai, Y.H., Yuan, Y.X.: Analysis of monotone gradient methods. J. Ind. Mang. Optim. 1(2), 181 (2005) 15. De Asmundis, R., Di Serafino, D., Hager, W.W., Toraldo, G., Zhang, H.: An efficient gradient method using the Yuan steplength. Comp. Optim. Appl. 59(3), (2014) 16. Di Serafino, D., Ruggiero, V., Toraldo, G., Zanni, L.: On the steplength selection in gradient methods for unconstrained optimization. Appl. Math. Comput. 318, (2018) 17. Elman, H.C., Golub, G.H.: Inexact and preconditioned Uzawa algorithms for saddle point problems. SIAM J. Numer. Anal. 31(6), (1994) 18. Figueiredo, M.A., Nowa, R.D., Wright, S.J.: Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems. IEEE J. Sel. Top. Signal Process. 1(4), (2007) 19. Fletcher, R.: On the Barzilai Borwein method. In: Optimization and control with applications pp (2005) 20. Forsythe, G.E.: On the asymptotic directions of the s-dimensional optimum gradient method. Numer. Math. 11(1), (1968) 21. Frassoldati, G., Zanni, L., Zanghirati, G.: New adaptive stepsize selections in gradient methods. J. Ind. Mang. Optim. 4(2), 299 (2008) 22. Friedlander, A., Martínez, J.M., Molina, B., Raydan, M.: Gradient method with retards and generalizations. SIAM J. Numer. Anal. 36(1), (1998) 23. Gonzaga, C.C., Schneider, R.M.: On the steepest descent algorithm for quadratic functions. Comp. Optim. Appl. 63(2), (2016) 24. Huang, Y., Liu, H.: Smoothing projected Barzilai Borwein method for constrained non- Lipschitz optimization. Comp. Optim. Appl. 65(3), (2016) 25. Huang, Y., Liu, H., Zhou, S.: Quadratic regularization projected Barzilai Borwein method for nonnegative matrix factorization. Data Min. Knowl. Disc. 29(6), (2015) 26. Jiang, B., Dai, Y.H.: Feasible Barzilai Borwein-lie methods for extreme symmetric eigenvalue problems. Optim. Method Softw. 28(4), (2013) 27. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788 (1999) 28. Liu, Y.F., Dai, Y.H., Luo, Z.Q.: Coordinated beamforming for MISO interference channel: Complexity analysis and efficient algorithms. IEEE Trans. Signal Process. 59(3), (2011) 29. Nocedal, J., Sartenaer, A., Zhu, C.: On the behavior of the gradient norm in the steepest descent method. Comp. Optim. Appl. 22(1), 5 35 (2002) 30. Raydan, M.: On the Barzilai and Borwein choice of steplength for the gradient method. IMA J. Numer. Anal. 13(3), (1993) 31. Raydan, M.: The Barzilai and Borwein gradient method for the large scale unconstrained minimization problem. SIAM J. Optim. 7(1), (1997) 32. Yuan, Y.X.: A new stepsize for the steepest descent method. J. Comput. Math. pp (2006) 33. Yuan, Y.X.: Step-sizes for the gradient method. AMS IP Studies in Advanced Mathematics 42(2), (2008) 34. Zhou, B., Gao, L., Dai, Y.H.: Gradient methods with adaptive step-sizes. Comp. Optim. Appl. 35(1), (2006)

On spectral properties of steepest descent methods

ON SPECTRAL PROPERTIES OF STEEPEST DESCENT METHODS of 20 On spectral properties of steepest descent methods ROBERTA DE ASMUNDIS Department of Statistical Sciences, University of Rome La Sapienza, Piazzale