Math 164: Optimization Barzilai-Borwein Method Instructor: Wotao Yin Department of Mathematics, UCLA Spring 2015 online discussions on piazza.com
Main features of the Barzilai-Borwein (BB) method The BB method was published in a 8-page paper 1 in 1988 It is a gradient method with modified step sizes, which are motivated by Newton s method but not involves any Hessian At nearly no extra cost, the method often significantly improves the performance of a standard gradient method The method is used along with non-monotone line search as a safeguard 1 J. Barzilai and J. Borwein. Two-point step size gradient method. IMA J. Numerical Analysis 8, 141 148, 1988.
Motivation of the BB method Let g (k) = f (x (k) ) and F (k) = 2 f (x (k) ). gradient method: x (k+1) = x (k) α k g (k) choice of α k : fixed, exact line search, or fixed initial + line search pros: simple cons: no use of 2nd order information, sometimes zig-zag Newton s method: x (k+1) = x (k) (F (k) ) 1 g (k) pros: 2nd-order information, 1-step for quadratic function, fast convergence near solution cons: forming and computing (F (k) ) 1 is expensive, need modifications if F (k) 0 The BB method chooses α k so that α k g (k) approximates (F (k) ) 1 g (k) without computing F (k)
Derive the BB method Consider minimize x f (x) = 1 2 xt Ax b T x, where A 0 is symmetric. Gradient is g (k) = Ax (k) b. Hessian is A. Newton step: d (k) newton = A 1 g (k) Goal: choose α k so that α k g (k) = (α 1 k I ) 1 g (k) approximates A 1 g (k) Define: s (k 1) := x (k) x (k 1) and y (k 1) := g (k) g (k 1). Then A satisfies: As (k 1) = y (k 1). Therefore, given s (k 1) and y (k 1), how about choose α k so that (α 1 k I )s (k 1) y (k 1)
Goal: (α 1 k I )s (k 1) y (k 1). BB method: Least-squares problem: (let β = α 1 ) α 1 k = arg min β Alternative Least-squares problem: α k = arg min α α 1 k and α 2 k are called the BB step sizes. 1 2 s(k 1) β y (k 1) 2 = αk 1 = (s(k 1) ) T s (k 1) (s (k 1) ) T y (k 1) 1 2 s(k 1) y (k 1) α 2 = αk 2 = (s(k 1) ) T y (k 1) (y (k 1) ) T y (k 1)
Apply the BB method Since x (k 1) and g (k 1) and thus s (k 1) and y (k 1) are unavailable at k = 0, we apply the standard gradient descent at k = 0 and start BB at k = 1 We can use either αk 1 or αk 2 or alternate between them We can fix α k = αk 1 or α k = αk 2 for a few consecutive steps It performs very well on minimizing quadratic and many other functions However, f k and f k are not monotonic!
Steepest descent versus BB on quadratic programming Model: Gradient iteration minimize x f (x) := 1 2 xt Ax b T x. x k+1 x (k) α k (Ax (k) b). Steepest descent selects α k as arg min α f (x (k) α k (Ax (k) b)) where r (k) := b Ax (k). α k = (r k ) T r (k) (r k ) T Ar (k) BB selects α k as α 1 k = (s(k 1) ) T s (k 1) (s (k 1) ) T y (k 1)
Numerical example Set symmetric matrix A to have the condition number λmax(a) λ min (A) = 50. Stopping criterion: r (k) < 10 8 Steepest descent stops in 90 iterations BB stops in 10 iterations 2 1.5 Contour Gradient descent 90 steps Barzilai Borwein 10 steps 1 0.5 0 0.5 1 1.5 2 5 4 3 2 1 0 1 2 3 4 5
Properties of Barzilai-Borwein For quadratic functions, it has R-linear convergence 2 For 2D quadratic function, it has Q-superlinear convergence 3 No convergence guarantee for smooth convex problems. On these problems, we pair up BB with non-monotone line search. 10 2 10 0 10 2 10 4 f fmin 10 6 10 8 10 10 10 12 0 100 200 300 400 500 600 iteration number BB on Laplace2: min 1 2 xt Ax b T x + h2 4 ijk u4 ijk. 2 Dai and Liao [2002] 3 Barzilai and Borwein [1988], Dai [2013]
Nonmonotone line search Some growth in the function value is permitted Sometimes improve the likelihood of finding a global optimum Improve convergence speed when a monotone scheme is forced to creep along the bottom of a narrow curved valley Early nonmonotone line search method 4 developed for Newton s methods f (x (k) + αd (k) ) max f (x k j ) + c 1α fk T d (k) 0 j m k However, it may still kill R-linear convergence. Example: x R, minimize f (x) = 1 x 2 x2, x 0 0, d (k) = x (k). { α k = 1 2 k, k = i 2 for some integer i, 2, otherwise, converges R-linear but fails to satisfy the condition for k large. 4 Grippo, Lampariello, and Lucidi [1986]
Zhang-Hager nonmonotone line search 5 1. initialize 0 < c 1 < c 2 < 1, C 0 f (x 0 ), Q 0 1, η < 1, k 0 2. while not converged do 3a. compute α k satisfying the modified Wolfe conditions OR 3b. find α k by backtracking, to satisfy the modified Armijo condition: sufficient decrease: f (x (k) + α k d (k) ) C k + c 1α k f T k d (k) 4. x k+1 x (k) + α k d (k) 5. Q k+1 ηq k + 1, C k+1 (ηq k C k + f (x k+1 ))/Q k+1. Comments: If η = 1, then C k = 1 k+1 k j=0 fj. Since η < 1, C k is a weighted sum of all past f j, more weights on recent f j. 5 Zhang and Hager [2004]
Convergence (advanced topic) The results below are left to the reader as an exercise. If f C 1 and bounded below, f T k d (k) < 0, then f k C k 1 k+1 (k) j=0 fj there exists α k satisfying the modified Wolfe or Armijo conditions In addition, if f is Lipschitz with constant L, then α k > C f T k d(k) d (k) backing factor for some constant depending on c 1, c 2, L and the Furthermore, if for all sufficiently large k, we have uniform bounds fk T d (k) c 3 f k 2 and d (k) c 4 f k then lim k f k = 0 Once again, pairing with non-monotone linear search, Barzilai-Borwein gradient methods work every well on general unconstrained differentiable problems.
References: Yu-Hong Dai and Li-Zhi Liao. R-linear convergence of the Barzilai and Borwein gradient method. IMA Journal of Numerical Analysis, 22(1):1 10, 2002. J. Barzilai and J.M. Borwein. Two-point step size gradient methods. IMA Journal of Numerical Analysis, 8(1):141 148, 1988. Yu-Hong Dai. A new analysis on the barzilai-borwein gradient method. Journal of the Operations Research Society of China, pages 1 12, 2013. Luigi Grippo, Francesco Lampariello, and Stephano Lucidi. A nonmonotone line search technique for Newton s method. SIAM Journal on Numerical Analysis, 23(4): 707 716, 1986. Hongchao Zhang and William W Hager. A nonmonotone line search technique and its application to unconstrained optimization. SIAM Journal on Optimization, 14(4): 1043 1056, 2004.