First Published on: 11 October 2006 To link to this article: DOI: / URL:

Similar documents
New hybrid conjugate gradient algorithms for unconstrained optimization

New Accelerated Conjugate Gradient Algorithms for Unconstrained Optimization

New Hybrid Conjugate Gradient Method as a Convex Combination of FR and PRP Methods

New hybrid conjugate gradient methods with the generalized Wolfe line search

AN EIGENVALUE STUDY ON THE SUFFICIENT DESCENT PROPERTY OF A MODIFIED POLAK-RIBIÈRE-POLYAK CONJUGATE GRADIENT METHOD S.

on descent spectral cg algorithm for training recurrent neural networks

Open Problems in Nonlinear Conjugate Gradient Algorithms for Unconstrained Optimization

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications

Bulletin of the. Iranian Mathematical Society

A Nonlinear Conjugate Gradient Algorithm with An Optimal Property and An Improved Wolfe Line Search

An Efficient Modification of Nonlinear Conjugate Gradient Method

Guangzhou, P.R. China

A Modified Hestenes-Stiefel Conjugate Gradient Method and Its Convergence

On Descent Spectral CG algorithms for Training Recurrent Neural Networks

Modification of the Armijo line search to satisfy the convergence properties of HS method

Step-size Estimation for Unconstrained Optimization Methods

Global Convergence Properties of the HS Conjugate Gradient Method

1 Numerical optimization

Research Article A Descent Dai-Liao Conjugate Gradient Method Based on a Modified Secant Equation and Its Global Convergence

January 29, Non-linear conjugate gradient method(s): Fletcher Reeves Polak Ribière January 29, 2014 Hestenes Stiefel 1 / 13

Characterizations of Student's t-distribution via regressions of order statistics George P. Yanev a ; M. Ahsanullah b a

Online publication date: 01 March 2010 PLEASE SCROLL DOWN FOR ARTICLE

University, Tempe, Arizona, USA b Department of Mathematics and Statistics, University of New. Mexico, Albuquerque, New Mexico, USA

Dissipation Function in Hyperbolic Thermoelasticity

Erciyes University, Kayseri, Turkey

1 Numerical optimization

Online publication date: 30 March 2011

The Wolfe Epsilon Steepest Descent Algorithm

THE RELATIONSHIPS BETWEEN CG, BFGS, AND TWO LIMITED-MEMORY ALGORITHMS

Convergence of a Two-parameter Family of Conjugate Gradient Methods with a Fixed Formula of Stepsize

Adaptive two-point stepsize gradient algorithm

A derivative-free nonmonotone line search and its application to the spectral residual method

Unconstrained optimization

Full terms and conditions of use:

Step lengths in BFGS method for monotone gradients

GLOBAL CONVERGENCE OF CONJUGATE GRADIENT METHODS WITHOUT LINE SEARCH

Spectral gradient projection method for solving nonlinear monotone equations

An Alternative Three-Term Conjugate Gradient Algorithm for Systems of Nonlinear Equations

Preconditioned conjugate gradient algorithms with column scaling

A family of derivative-free conjugate gradient methods for large-scale nonlinear systems of equations

The Fourier transform of the unit step function B. L. Burrows a ; D. J. Colwell a a

Chapter 4. Unconstrained optimization

Nonlinear conjugate gradient methods, Unconstrained optimization, Nonlinear

Chapter 10 Conjugate Direction Methods

Global Convergence of Perry-Shanno Memoryless Quasi-Newton-type Method. 1 Introduction

Research Article Nonlinear Conjugate Gradient Methods with Wolfe Type Line Search

Improved Damped Quasi-Newton Methods for Unconstrained Optimization

Precise Large Deviations for Sums of Negatively Dependent Random Variables with Common Long-Tailed Distributions

Communications in Algebra Publication details, including instructions for authors and subscription information:

University of Thessaloniki, Thessaloniki, Greece

Geometrical optics and blackbody radiation Pablo BenÍTez ab ; Roland Winston a ;Juan C. Miñano b a

NUMERICAL COMPARISON OF LINE SEARCH CRITERIA IN NONLINEAR CONJUGATE GRADIENT ALGORITHMS

A Novel of Step Size Selection Procedures. for Steepest Descent Method

Online publication date: 22 March 2010

Gradient method based on epsilon algorithm for large-scale nonlinearoptimization

Downloaded 12/02/13 to Redistribution subject to SIAM license or copyright; see

Use and Abuse of Regression

Full terms and conditions of use:

A modified quadratic hybridization of Polak-Ribière-Polyak and Fletcher-Reeves conjugate gradient method for unconstrained optimization problems

Park, Pennsylvania, USA. Full terms and conditions of use:

Math 164: Optimization Barzilai-Borwein Method

Unconstrained Multivariate Optimization

Numerical Optimization of Partial Differential Equations

On the convergence properties of the modified Polak Ribiére Polyak method with the standard Armijo line search

Programming, numerics and optimization

The American Statistician Publication details, including instructions for authors and subscription information:

The Conjugate Gradient Method

Gilles Bourgeois a, Richard A. Cunjak a, Daniel Caissie a & Nassir El-Jabi b a Science Brunch, Department of Fisheries and Oceans, Box

OF SCIENCE AND TECHNOLOGY, TAEJON, KOREA

Version of record first published: 01 Sep 2006.

Acyclic, Cyclic and Polycyclic P n

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

Introduction to Nonlinear Optimization Paul J. Atzberger

PLEASE SCROLL DOWN FOR ARTICLE

Open problems. Christian Berg a a Department of Mathematical Sciences, University of. Copenhagen, Copenhagen, Denmark Published online: 07 Nov 2014.

Nacional de La Pampa, Santa Rosa, La Pampa, Argentina b Instituto de Matemática Aplicada San Luis, Consejo Nacional de Investigaciones Científicas

Quasi-Newton Methods

Conjugate gradient methods based on secant conditions that generate descent search directions for unconstrained optimization

FB 4, University of Osnabrück, Osnabrück

Online publication date: 29 April 2011

Residual iterative schemes for largescale linear systems

PLEASE SCROLL DOWN FOR ARTICLE

Testing Goodness-of-Fit for Exponential Distribution Based on Cumulative Residual Entropy

PLEASE SCROLL DOWN FOR ARTICLE. Full terms and conditions of use:

ON THE CONNECTION BETWEEN THE CONJUGATE GRADIENT METHOD AND QUASI-NEWTON METHODS ON QUADRATIC PROBLEMS

Statistics 580 Optimization Methods

Global Existence of Large BV Solutions in a Model of Granular Flow

Journal of Computational and Applied Mathematics. Notes on the Dai Yuan Yuan modified spectral gradient method

Nonlinear Programming

Diatom Research Publication details, including instructions for authors and subscription information:

A Strongly Convergent Method for Nonsmooth Convex Minimization in Hilbert Spaces

Dept. of Mathematics, University of Dundee, Dundee DD1 4HN, Scotland, UK.

Optimization and Root Finding. Kurt Hornik

An Iterative Descent Method

8 Numerical methods for unconstrained problems

Line Search Methods for Unconstrained Optimisation

Steepest Descent. Juan C. Meza 1. Lawrence Berkeley National Laboratory Berkeley, California 94720

OPER 627: Nonlinear Optimization Lecture 14: Mid-term Review

A DIMENSION REDUCING CONIC METHOD FOR UNCONSTRAINED OPTIMIZATION

A COMBINED CLASS OF SELF-SCALING AND MODIFIED QUASI-NEWTON METHODS

Transcription:

his article was downloaded by:[universitetsbiblioteet i Bergen] [Universitetsbiblioteet i Bergen] On: 12 March 2007 Access Details: [subscription number 768372013] Publisher: aylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1 3JH, UK Optimization Methods and Software Publication details, including instructions for authors and subscription information: http://www.informaworld.com/smpp/title~content=t713645924 Scaled memoryless BFGS preconditioned conjugate gradient algorithm for unconstrained optimization Neculai Andrei First Published on: 11 October 2006 o lin to this article: DOI: 10.1080/10556780600822260 URL: http://dx.doi.org/10.1080/10556780600822260 Full terms and conditions of use: http://www.informaworld.com/terms-and-conditions-of-access.pdf his article maybe used for research, teaching and private study purposes. Any substantial or systematic reproduction, re-distribution, re-selling, loan or sub-licensing, systematic supply or distribution in any form to anyone is expressly forbidden. he publisher does not give any warranty express or implied or mae any representation that the contents will be complete or accurate or up to date. he accuracy of any instructions, formulae and drug doses should be independently verified with primary sources. he publisher shall not be liable for any loss, actions, claims, proceedings, demand or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material. aylor and Francis 2007

Downloaded By: [Universitetsbiblioteet i Bergen] At: 15:55 12 March 2007 Optimization Methods and Software Vol. 00, No. 0, Month 2007, 1 11 Scaled memoryless BFGS preconditioned conjugate gradient algorithm for unconstrained optimization NECULAI ANDREI* Research Institute for Informatics, Center for Advanced Modeling and Optimization, 8-10, Averescu Avenue, Bucharest 1, Romania (Received 18 March 2005; revised 8 September 2005; in final form 23 May 2006) A scaled memoryless BFGS preconditioned conjugate gradient algorithm for solving unconstrained optimization problems is presented. he basic idea is to combine the scaled memoryless BFGS method and the preconditioning technique in the frame of the conjugate gradient method. he preconditioner, which is also a scaled memoryless BFGS matrix, is reset when the Beale Powell restart criterion holds. he parameter scaling the gradient is selected as the spectral gradient. In very mild conditions, it is shown that, for strongly convex functions, the algorithm is globally convergent. Computational results for a set consisting of 750 unconstrained optimization test problems show that this new scaled conjugate gradient algorithm substantially outperforms the nown conjugate gradient methods including the spectral conjugate gradient by Birgin and Martínez [Birgin, E. and Martínez, J.M., 2001, A spectral conjugate gradient method for unconstrained optimization. Applied Mathematics and Optimization, 43, 117 128], the conjugate gradient by Pola and Ribière [Pola, E. and Ribière, G., 1969, Note sur la convergence de méthodes de directions conjuguées. Revue Francaise Informat. Reserche Opérationnelle, 16, 35 43], as well as the most recent conjugate gradient method with guaranteed descent by Hager and Zhang [Hager, W.W. and Zhang, H., 2005, A new conjugate gradient method with guaranteed descent and an efficient line search. SIAM Journal on Optimization, 16, 170 192; Hager, W.W. and Zhang, H., 2004, CG-DESCEN, A conjugate gradient method with guaranteed descent ACM ransactions on Mathematical Software, 32, 113 137]. Keywords: Unconstrained optimization; Conjugate gradient method; Spectral gradient method; Wolfe line search; BFGS preconditioning 2000 Mathematics Subject Classification: 49M07; 49M10; 90C06; 65K 1. Introduction In this paper, the following unconstrained optimization problem is considered: min f(x), (1) where f : R n R is continuously differentiable and its gradient is available. Elaborating an algorithm is of interest for solving large-scale cases for which the Hessian of f is either not available or requires a large amount of storage and computational costs. *Corresponding author. Email: nandrei@ici.ro Optimization Methods and Software ISSN 1055-6788 print/issn 1029-4937 online 2007 aylor & Francis http://www.tandf.co.u/journals DOI: 10.1080/10556780600822260

2 N. Andrei Downloaded By: [Universitetsbiblioteet i Bergen] At: 15:55 12 March 2007 he paper presents a conjugate gradient algorithm based on a combination of the scaled memoryless BFGS method and the preconditioning technique. For general non-linear functions, a good preconditioner is any matrix that approximates 2 f(x ) 1, where x is the solution of (1). In this algorithm, the preconditioner is a scaled memoryless BFGS matrix which is reset when the Powell restart criterion holds. he scaling factor in the preconditioner is selected as the spectral gradient. he algorithm uses the conjugate gradient direction where the famous parameter β is obtained by equating the conjugate gradient direction with the direction corresponding to the Newton method. hus, a general formula for the direction computation is obtained, which could be particularized to include the Pola and Ribiére [1] and the Fletcher and Reeves [2] conjugate gradient algorithms, the spectral conjugate gradient (SCG) by Birgin and Martínez [3] or the algorithm of Dai and Liao [4], for t = 1. his direction is then modified in a canonical manner, as it was considered earlier by Oren and Luenberger [5], Oren and Spedicato [6], Perry [7] and Shanno [8,9], by means of a scaled, memoryless BFGS preconditioner placed in the Beale Powell restart technology. he scaling factor is computed in a spectral manner based on the inverse Rayleigh quotient, as suggested by Raydan [10]. he method is an extension of the SCG by Birgin and Martínez [3] or of a variant of the conjugate gradient algorithm by Dai and Liao [4] (for t = 1) to overcome the lac of positive definiteness of the matrix defining their search direction. he paper is organized as follows. In section 2, the method is presented. Section 3 is dedicated to the scaled conjugate gradient (SCALCG) algorithm. he algorithm performs two types of steps: a standard one in which a double quasi-newton updating scheme is used and a restart one where the current information is used to define the search direction. he convergence of the algorithm for strongly convex functions is proved in section 4. Finally, in section 5, the computational results on a set of 750 unconstrained optimization problems from the CUE [11] collection along with some other large-scale unconstrained optimization problems are presented and the performance of the new algorithm is compared with the SCG conjugate gradient method by Birgin and Martínez [3], the Pola and Ribiére conjugate gradient algorithm [1], as well as the recent conjugate gradient method with guaranteed descent (CG_DESCEN) by Hager and Zhang [12,13]. 2. Method he algorithm generates a sequence x of approximations to the minimum x of f, in which x +1 = x + α d, (2) d +1 = θ +1 g +1 + β s, (3) where g = f(x ), α is selected to minimize f(x) along the search direction d, β is a scalar parameter, s = x +1 x and θ +1 is a parameter to be determined. he iterative process is initialized with an initial point x 0 and d 0 = g 0. Observe that if θ +1 = 1, then the classical conjugate gradient algorithms are obtained according to the value of the scalar parameter β. In contrast, if β = 0, then another class of algorithms are obtained according to the selection of the parameter θ +1. Considering β = 0, there are two possibilities for θ +1 : a positive scalar or a positive definite matrix. If θ +1 = 1, then the steepest descent algorithm results. If θ +1 = 2 f(x +1 ) 1 or an approximation of it, then the Newton or the quasi-newton algorithms are obtained, respectively. herefore, it is seen that in the general case, when θ +1 = 0 is selected in a quasi-newton manner, and β = 0, (3) represents a combination between the quasi-newton and the conjugate gradient

Scaled memoryless BFGS preconditioned conjugate gradient algorithm 3 Downloaded By: [Universitetsbiblioteet i Bergen] At: 15:55 12 March 2007 methods. However, if θ +1 is a matrix containing some useful information about the inverse Hessian of function f, it is better off using d +1 = θ +1 g +1 because the addition of the term β s in (3) may prevent the direction d from being a descent direction unless the line search is sufficiently accurate. herefore, in this paper, θ +1 will be considered as a positive scalar which contains some useful information about the inverse Hessian of function f. o determine β, consider the following procedure. As it is nown, the Newton direction for solving (1) is given by d +1 = 2 f(x +1 ) 1 g +1. herefore, from the equality 2 f(x +1 ) 1 g +1 = θ +1 g +1 + β s, Using the aylor development, after some algebra, β = s 2 f(x +1 )θ +1 g +1 s g +1 s 2 f(x +1 )s. (4) β = (θ +1y s ) g +1 y s is obtained, where y = g +1 g. Birgin and Martínez [3] arrived at the same formula for β, but using a geometric interpretation of quadratic function minimization. he direction corresponding to β given in (5) is as follows: (5) d +1 = θ +1 g +1 + (θ +1y s ) g +1 y s s. (6) his direction is used by Birgin and Martínez [3] in their SCG pacage for unconstrained optimization, where θ +1 is selected in a spectral manner, as suggested by Raydan [10]. he following particularizations are obvious. If θ +1 = 1, then (6) is the direction considered by Perry [7]. At the same time, it is seen that (6) is the direction given by Dai and Liao [4] for t = 1, obtained this time by an interpretation of the conjugacy condition. Additionally, if s j g j+1 = 0, j = 0, 1,...,, then from (6), d +1 = θ +1 g +1 + θ +1y g +1 α θ g g s, (7) which is the direction corresponding to a generalization of the Pola and Ribière formula. Of course, if θ +1 = θ = 1 in (7), the classical Pola and Ribière formula [1] is obtained. If s j g j+1 = 0, j = 0, 1,...,, and additionally the successive gradients are orthogonal, then from (6) d +1 = θ +1 g +1 + θ +1g+1 g +1 α θ g g s, (8) which is the direction corresponding to a generalization of the Fletcher and Reeves formula [2]. herefore, (6) is a general formula for direction computation in a conjugate gradient manner including the classical Fletcher and Reeves [6] and Pola and Ribière [1] formulas. here is a result by Shanno [8,9] that says that the conjugate gradient method is precisely the BFGS quasi-newton method for which the initial approximation to the inverse of the Hessian, at every step, is taen as the identity matrix. he extension to the scaled conjugate gradient is very simple. Using the same methodology as considered by Shanno [8], the following direction

4 N. Andrei Downloaded By: [Universitetsbiblioteet i Bergen] At: 15:55 12 March 2007 d +1 is obtained: ( ) [ g ( d +1 = θ +1 g +1 + θ +1 s y +1 y s y 1 + θ y ) g +1 s g+1 +1 y s y s θ y ] +1 y s s, (9) involving only four scalar products. Again observe that if g+1 s = 0, then (9) reduces to g+1 d +1 = θ +1 g +1 + θ y +1 y s s. (10) hus, in this case, the effect is simply one of multiplying the Hestenes and Stiefel [14] search direction by a positive scalar. In order to ensure the convergence of algorithm (2), with d +1 given by (9), the choice of α should be constrained. Consider the line searches that satisfy the Wolfe conditions [15,16]: where 0 <σ 1 σ 2 < 1. f(x + α d ) f(x ) σ 1 α g d, (11) f(x + α d ) d σ 2 g d, (12) HEOREM 1 Suppose that α in (2) satisfies the Wolfe conditions (11) and (12), then the direction d +1 given by (9) is a descent direction. Proof As d 0 = g 0, g 0 d 0 = g 0 2 0. Multiplying (9) by g +1, g +1 d +1 = 1 [ (y s θ+1 g ) 2 +1 2 (y s ) 2 + 2θ +1 (g+1 y )(g+1 s )(y s ) (g+1 s ) 2 (y s ) θ +1 (y y )(g+1 s ) 2]. Applying the inequality u v 1/2( u 2 + v 2 ) to the second term on the right-hand side of the above equality, with u = (s y )g +1 and v = (g +1 s )y, g+1 d +1 (g +1 s ) 2 y s. (13) is obtained. But, by Wolfe condition (12), y s > 0. herefore, g+1 d +1 < 0, for every = 0, 1,... Observe that the second Wolfe condition (12) is crucial for the descent character of direction (9). Besides, it is seen that estimation (13) is independent of the parameter θ +1. Usually, all conjugate gradient algorithms are periodically restarted. he Powell restarting procedure [17,18] is to test if there is very little orthogonality left between the current gradient and the previous one. At step r, when g r+1 g r 0.2 gr+1 2, (14) we restart the algorithm using the direction given by (9). At step r, s r, y r and θ r+1 are nown. If (14) is satisfied, then a restart step is considered, i.e. the direction is computed as in (9). For r + 1, the same philosophy used by Shanno

Scaled memoryless BFGS preconditioned conjugate gradient algorithm 5 Downloaded By: [Universitetsbiblioteet i Bergen] At: 15:55 12 March 2007 [8,9] is considered, where the gradient g +1 is modified by a positive definite matrix which best estimates the inverse Hessian without any additional storage requirements, that is, ( ) [ g ( v = θ r+1 g +1 θ +1 s r yr r+1 yr s y r + 1 + θ y ) r g +1 s r g+1 r+1 r yr s r yr s θ y ] r r+1 r yr s s r, (15) r and ( y ) [( w = θ r+1 y θ s r yr r+1 yr s y r + 1 + θ y ) r y s r y r+1 r yr s r yr s θ y ] r r+1 r yr s s r (16) r involving six scalar products. With these, at any non-restart step, the direction d +1 for r + 1 is computed using a double update scheme as in Shanno [8]: d +1 = v + (g +1 s )w + (g+1 w)s ( y s 1 + y w ) g +1 s y s y s s, (17) involving only 4 scalar products. Observe that y s > 0 is sufficient to ensure that the direction d +1 given by (17) is well defined and it is always a descent direction. Motivated by the efficiency of the spectral gradient method introduced by Raydan [10] and used by Birgin and Martínez [3] in their SCG method for unconstrained optimization, in the algorithm given in this wor, θ +1 is defined as a scalar approximation to the inverse Hessian. his is given as the inverse of the Rayleigh quotient: [ ] s 1 0 2 f(x + ts )dt s s s, i.e. θ +1 = s s y s. (18) he inverse of Rayleigh quotient lies between the smallest and the largest eigenvalue of the Hessian average 1 0 2 f(x + ts )dt. Again observe that y s > 0 is sufficient to ensure that θ +1 in (18) is well defined. 3. SCALCG algorithm Having in view the above developments and the definitions of g, s and y, as well as the selection procedure for θ +1 computation, the following SCALCG algorithm can be presented. Step 1. Initialization. Select x 0 R n and the parameters 0 <σ 1 σ 2 < 1. Compute f(x 0 ) and g 0 = f(x 0 ). Set d 0 = g 0 and α 0 = 1/ g 0. Set = 0. Step 2. Line search. Compute α satisfying the Wolfe conditions (11) and (12). Update the variables x +1 = x + α d. Compute f(x +1 ), g +1 and s = x +1 x, y = g +1 g. Step 3. est for continuation of iterations. If this test is satisfied, the iterations are stopped, else set = + 1. Step 4. Scaling factor computation. Compute θ using (18). Step 5. Restart direction. Compute the (restart) direction d as in (9). Step 6. Line search. Compute the initial guess: α = α 1 d 1 2 / d 2. Using this initialization, compute α satisfying the Wolfe conditions. Update the variables x +1 = x + α d. Compute f(x +1 ), g +1 and s = x +1 x, y = g +1 g.

6 N. Andrei Downloaded By: [Universitetsbiblioteet i Bergen] At: 15:55 12 March 2007 Step 7. Store θ = θ, s = s and y = y. Step 8. est for continuation of iterations. If this test is satisfied, the iterations are stopped, else set = + 1. Step 9. Restart. If the Powell restart criterion (14) is satisfied, then go to step 4 (a restart step); otherwise, continue with step 10 (a standard step). Step 10. Standard direction. Compute the direction d as in (17), where v and w are computed as in (15) and (16) with saved values θ,s and y. Step 11. Line search. Compute the initial guess: α = α 1 d 1 2 / d 2. Using this initialization, compute α satisfying the Wolfe conditions. Update the variables x +1 = x + α d. Compute f(x +1 ), g +1 and s = x +1 x, y = g +1 g. Step 12. est for continuation of iterations. If this test is satisfied, the iterations are stopped, else set = + 1 and go to step 9. It is well nown that if f is bounded below along the direction d, then there exists a step length α satisfying the Wolfe conditions. he initial selection of the step length crucially affects the practical behaviour of the algorithm. At every iteration 1, the starting guess for the step α in line search is computed as α 1 d 1 2 / d 2. his procedure was considered for the first time by Shanno and Phua [19] in CONMIN. he same one is taen by Birgin and Martínez [3] in SCG. Concerning the stopping criterion to be used in steps 3, 8 and 12, consider g ε g, (19) where denotes the maximum absolute component of a vector and ε g is a tolerance specified by the user. 4. Convergence analysis for strongly convex functions hroughout this section, it is assumed that f is strongly convex and Lipschitz continuous on the level set L 0 = { x R n : f(x) f(x 0 ) }. (20) hat is, there exists constants µ>0and L such that ( f(x) f(y)) (x y) µ x y 2 (21) and f(x) f(y) L x y, (22) for all x and y from L 0. For the convenience of the reader, the following lemma [12] is included here. LEMMA 1 Assume that d is a descent direction and f satisfies the Lipschitz condition f(x) f(x ) L x x, (23) for every x on the line segment connecting x and x +1, where L is a constant. If the line search satisfies the second Wolfe condition (12), then α 1 σ 2 g d L d 2. (24)

Scaled memoryless BFGS preconditioned conjugate gradient algorithm 7 Downloaded By: [Universitetsbiblioteet i Bergen] At: 15:55 12 March 2007 Proof Subtracting g d from both sides of (12) and using the Lipschitz condition, (σ 2 1)g d (g +1 g ) d Lα d 2. (25) Because d is a descent direction and σ 2 < 1, (24) follows immediately from (25). LEMMA 2 Assume that f is strongly convex and Lipschitz continuous on L 0.Ifθ +1 is selected by spectral gradient, then the direction d +1 given by (9) satisfies: d +1 Proof By Lipschitz continuity (22), ( 2 µ + 2L ) µ + L2 g 2 µ 3 +1. (26) y = g +1 g = f(x + α d ) f(x ) Lα d = L s. (27) In contrast, by strong convexity (21), y s µ s 2. (28) Selecting θ +1 as in (18), it follows that θ +1 = s s y s s 2 µ s 2 = 1 µ. (29) Now, using the triangle inequality and estimates (27) (29), after some algebra on d +1, where d +1 is given by (9), (26) is obtained. he convergence of the SCALCG algorithm when f is strongly convex is given by the following theorem. HEOREM 2 Assume that f is strongly convex and Lipschitz continuous on the level set L 0.If at every step of the conjugate gradient (2)d +1 is given by (9) and the step length α is selected to satisfy the Wolfe conditions (11) and (12), then either g = 0 for some or lim g = 0. Proof Suppose g = 0 for all. By strong convexity, y d = (g +1 g ) d µα d 2. (30) By heorem 1, g d < 0. herefore, the assumption g = 0 implies d = 0. As α > 0, from (30), it follows that y d > 0. But f is strongly convex over L 0, therefore f is bounded from

8 N. Andrei Downloaded By: [Universitetsbiblioteet i Bergen] At: 15:55 12 March 2007 below. Now, summing over the first Wolfe condition (11), α g d >. =0 Considering the lower bound for α given by (24) in Lemma 1 and having in view that d is a descent direction, it follows that g d 2 d 2 <. (31) =1 Now, from (13), using the inequality of Cauchy and (28), g+1 d +1 (g +1 s ) 2 y s g +1 2 s 2 µ s 2 = g +1 2. µ herefore, from (31), it follows that g 4 =0 Now, inserting the upperbound (26), for d in (32), yields g 2 <, =0 <. (32) 2 d which completes the proof. For general functions, the convergence of the algorithm comes from heorem 1 and the restart procedure. herefore, for strongly convex functions and under inexact line search, it is globally convergent. o a great extent, however, the SCALCG algorithm is very close to the Perry/Shanno computational scheme [8,9]. SCALCG is a scaled memoryless BFGS preconditioned algorithm where the scaling factor is the inverse of a scalar approximation of the Hessian. If the Powell restart criterion (14) is used for general functions f bounded from below with bounded second partial derivatives and bounded level set, using the same arguments considered by Shanno [9], it is possible to prove that either the iterates converge to a point x satisfying g(x ) =0 or the iterates cycle. It remains for further study to determine a complete global convergence result and whether cycling can occur for general functions with bounded second partial derivatives and bounded level set. More sophisticated reasons for restarting the algorithms have been proposed in the literature, but the performance of an algorithm that uses the Powell restart criterion is of interest, which is associated with the scaled memoryless BFGS preconditioned direction choice for restart. Additionally, some convergence analysis with Powell restart criterion was given by Dai and Yuan [20] and can be used in this context of the preconditioned and scaled memoryless BFGS algorithm. 5. Computational results and comparisons In this section, the performance of a Fortran implementation of the SCALCG on a set of 750 unconstrained optimization test problems is presented. At the same time, the performance of

Scaled memoryless BFGS preconditioned conjugate gradient algorithm 9 Downloaded By: [Universitetsbiblioteet i Bergen] At: 15:55 12 March 2007 SCALCG is compared with the best SCG algorithm, (betatype =1, Perry-M1), by Birgin and Martínez [3], with the CG_DESCEN by Hager and Zhang [12,13] and with the Pola Ribière (PR) algorithm. he SCALCG code is authored by Andrei, whereas the SCG and PR are co-authored by Birgin and Martínez and CG_DESCEN is co-authored by Hager and Zhang. All codes are written in Fortran and compiled with f77 (default compiler settings) on an Intel Pentium 4, 1.5 GHz. he CG_DESCEN code contains the variant implementing the Wolfe line search (W) and the variant corresponding to the approximate Wolfe conditions (aw). All algorithms implement the same stopping criterion as in (19), where ε g = 10 6. he test problems are the unconstrained problems in the CUE [11] collection, along with other large-scale optimization problems. Seventy-five large-scale unconstrained optimization problems in extended or generalized form are selected. For each function, 10 numerical experiments with number of variables n = 1000, 2000,...,10, 000 have been considered. Let fi ALG1 and fi ALG2 be the optimal value found by ALG1 and ALG2, for problem i = 1,...,750, respectively. It is said that, in the particular problem i, the performance of ALG1 was better than the performance of ALG2 if f ALG1 i fi ALG2 < 10 3 and the number of iterations or the number of function-gradient evaluations or the CPU time of ALG1 was less than the number of iterations or the number of function-gradient evaluations or the CPU time corresponding to ALG2, respectively. he numerical results concerning the number of iterations, the number of restart iterations, the number of function and gradient evaluations, cpu time in seconds, for each of these methods can be found in ref. [21]. ables 1 4 show the number of problems, out of 750, for which SCALCG versus SCG, PR, CG_DESCEN(W) or CG_DESCEN(aW) achieved the minimum number of iterations (# iter), the minimum number of function evaluations (# fg) and the minimum cpu time (CPU), respectively. For example, when comparing SCALCG and SCG (table 1), subject to the number of iterations, SCALCG was better in 486 problems (i.e. it achieved the minimum number of iterations in 486 problems), SCG was better in 98 problems, they had the same number of iterations in 89 problems, etc. From these tables, it is seen that, at least for this set of 750 problems, the top performer is SCALCG. As these codes use the same Wolfe line search, able 1. Performance of SCALCG versus SCG (750 problems). SCALCG SCG = # iter 486 98 89 # fg 425 128 120 CPU 592 43 38 able 2. Performance of SCALCG versus PR (750 problems). SCALCG PR = # iter 497 93 81 # fg 415 134 122 CPU 580 54 37

10 N. Andrei Downloaded By: [Universitetsbiblioteet i Bergen] At: 15:55 12 March 2007 able 3. Performance of SCALCG versus CG DESCEN(W) (750 problems). SCALCG CG DESCEN(W) = # iter 440 173 28 # fg 444 164 33 CPU 515 91 35 able 4. Performance of SCALCG versus CG DESCEN(aW) (750 problems). SCALCG CG DESCEN(aW) = # iter 445 168 28 # fg 431 175 35 CPU 507 97 37 excepting CG_DESCEN(aW) which implements an approximate Wolfe line search, and the same stopping criterion, they differ in their choice of the search direction. Hence, among these conjugate gradient algorithms, SCALCG appears to generate the best search direction, on average. 6. Conclusion A scaled memoryless BFGS preconditioned conjugate gradient algorithm, SCALCG algorithm, for solving large-scale unconstrained optimization problems is presented. SCALCG can be considered as a modification of the best algorithm by Birgin and Martínez [3], which is mainly a scaled variant of Perry s [7], and of the Dai and Liao [4] (t = 1), in order to overcome the lac of positive definiteness of the matrix defining the search direction. his modification taes the advantage of the quasi-newton BFGS updating formula. Using the restart technology of Beale Powell, a SCALCG algorithm is obtained in which the parameter scaling the gradient is selected as spectral gradient. Although the update formulas (9) and (15) (17) are more complicated, the scheme proved to be efficient and robust in numerical experiments. he algorithm implements the Wolfe conditions and it has been proved that the steps are along the descent directions. he performances of the SCALCG algorithm were higher than those of SCG method by Birgin and Martínez, CG_DESCEN by Hager and Zhang and scaled PR for a set of 750 unconstrained optimization problems with dimensions ranging between 10 3 and 10 4. As each of these codes considered here is different, mainly in the amount of linear algebra required at each step, it is quite clear that different codes will be superior in different problem sets. Generally, one needs to solve thousand different problems before trends begin to emerge. For any particular problem, almost any method can win. However, there is strong computational evidence that the scaled memoryless BFGS preconditioned conjugate gradient algorithm is the top performer among these conjugate gradient algorithms. Acnowledgement he author was awarded the Romanian Academy Grant 168/2003.

Scaled memoryless BFGS preconditioned conjugate gradient algorithm 11 Downloaded By: [Universitetsbiblioteet i Bergen] At: 15:55 12 March 2007 References [1] Pola, E. and Ribière, G., 1969, Note sur la convergence de méthodes de directions conjuguées. Revue Francaise Informat. Reserche Opérationnelle, 16, 35 43. [2] Fletcher, R. and Reeves, C.M., 1964, Function minimization by conjugate gradients. he Computer Journal, 7, 149 154. [3] Birgin, E. and Martínez, J.M., 2001, A spectral conjugate gradient method for unconstrained optimization. Applied Mathematics and Optimization, 43, 117 128. [4] Dai, Y.H. and Liao, L.Z., 2001, New conjugate conditions and related nonlinear conjugate gradient methods. Applied Mathematics and Optimization, 43, 87 101. [5] Oren, S.S. and Luenberger, D.G., 1976, Self-scaling variable metric algorithm. Part I. Management Science, 20, 845 862. [6] Oren, S.S. and Spedicato, E., 1976, Optimal conditioning of self-scaling variable metric algorithms. Mathematical Programming, 10, 70 90. [7] Perry, J.M., 1977, A class of conjugate gradient algorithms with a two step variable metric memory. Discussion paper 269, Center for Mathematical Studies in Economics and Management Science, Northwestern University. [8] Shanno, D.F., 1978, Conjugate gradient methods with inexact searches. Mathematics of Operations Research, 3, 244 256. [9] Shanno, D.F., 1978, On the convergence of a new conjugate gradient algorithm. SIAM Journal on Numerical Analysis 15, 1247 1257. [10] Raydan, M., 1997, he Barzilai and Borwein gradient method for the large scale unconstrained minimization problem. SIAM Journal of Optimization, 7, 26 33. [11] Bongartz, I, Conn, A.R., Gould, N.I.M. and oint, P.L., 1995, CUE: constrained and unconstrained testing environments. ACM ransactions on Mathematical Software, 21, 123 160. [12] Hager, W.W. and Zhang, H., 2005, A new conjugate gradient method with guaranteed descent and an efficient line search. SIAM Journal on Optimization, 16, 170 192. [13] Hager, W.W. and Zhang, H., 2006,Algorithm 851: CG_DESCEN,A conjugate gradient method with guaranteed descent. ACM ransactions of Mathematical Software, 32, 113 137. [14] Hestenes, M.R. and Stiefel, E., 1952, Methods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Standards, Section B, 48, 409 436. [15] Wolfe, P., 1969, Convergence conditions for ascent methods. SIAM Review, 11, 226 235. [16] Wolfe, P., 1971, Convergence conditions for ascent methods II: some corrections. SIAM Review, 13, 185 188. [17] Powell, M.J.D., 1976, Some convergence properties of the conjugate gradient method. Mathematical Programming, 11, 42 49. [18] Powell, M.J.D., 1977, Restart procedures for the conjugate gradient method. Mathematical Programming, 12, 241 254. [19] Shanno, D.F. and Phua, K.H., 1976, Algorithm 500, minimization of unconstrained multivariate functions. ACM ransactions on Mathematical Software, 2, 87 94. [20] Dai, Y.H. and Yuan, Y., 1998, Convergence properties of the Beale Powell restart algorithm. Science in China (Series A), 41(11), 1142 1150. [21] Andrei, N., Available online at: http://www.ici.ro/camo/neculai/anpaper.htm/