Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Part 3: Trust-region methods for unconstrained optimization Nick Gould (RAL) minimize x IR n f(x) MSc course on nonlinear optimization UNCONSTRAINED MINIMIZATION minimize x IR n f(x) where the objective function f : IR n IR assume that f C (sometimes C 2 ) and Lipschitz often in practice this assumption violated, but not necessary

LINESEARCH VS TRUST-REGION METHODS Linesearch methods pick descent direction p k pick stepsize α k to reduce f(x k + αp k ) x k+ = x k + α k p k Trust-region methods pick step s k to reduce model of f(x k + s) accept x k+ = x k +s k if decrease in model inherited by f(x k +s k ) otherwise set x k+ = x k, refine model TRUST-REGION MODEL PROBLEM Model f(x k + s) by: linear model m L k (s) = f k + s T g k quadratic model symmetric B k Major difficulties: m Q k (s) = f k + s T g k + 2s T B k s models may not resemble f(x k + s) if s is large models may be unbounded from below linear model - always unless g k = 0 quadratic model - always if B k is indefinite, possibly if B k is only positive semi-definite

THE TRUST REGION Prevent model m k (s) from unboundedness by imposing a trust-region constraint s k for some suitable scalar radius k > 0 = trust-region subproblem approx minimize s IR n in theory does not depend on norm in practice it might! m k (s) subject to s k OUR MODEL For simplicity, concentrate on the second-order (Newton-like) model m k (s) = m Q k (s) = f k + s T g k + 2s T B k s and the l 2 -trust region norm = 2 Note: B k = H k is allowed analysis for other trust-region norms simply adds extra constants in following results

BASIC TRUST-REGION METHOD Given k = 0, 0 > 0 and x 0, until convergence do: Build the second-order model m(s) of f(x k + s). Solve the trust-region subproblem to find s k for which m(s k ) < f k and s k k, and define ρ k = f k f(x k + s k ). f k m k (s k ) If ρ k η v [very successful] 0 < η v < set x k+ = x k + s k and k+ = γ i k γ i Otherwise if ρ k η s then [successful] 0 < η s η v < set x k+ = x k + s k and k+ = k Otherwise [unsuccessful] set x k+ = x k and k+ = γ d k 0 < γ d < Increase k by SOLVE THE TRUST REGION SUBPROBLEM? At the very least aim to achieve as much reduction in the model as would an iteration of steepest descent Cauchy point: s C k = α C kg k where αk C = arg min m k ( αg k ) subject to α g k k α>0 = arg min m k ( αg k ) 0<α k / g k minimize quadratic on line segment = very easy! require that m k (s k ) m k (s C k) and s k k in practice, hope to do far better than this

ACHIEVABLE MODEL DECREASE Theorem 3.. If m k (s) is the second-order model and s C k is its Cauchy point within the trust-region s k, f k m k (s C k) 2 g k min g k + B k, k. PROOF OF THEOREM 3. m k ( αg k ) = f k α g k 2 + 2α 2 g T k B k g k. Result immediate if g k = 0. Otherwise, 3 possibilities (i) curvature g T k B k g k 0 = m k ( αg k ) unbounded from below as α increases = Cauchy point occurs on the trust-region boundary. (ii) curvature g T k B k g k > 0 & minimizer m k ( αg k ) occurs at or beyond the trust-region boundary = Cauchy point occurs on the trustregion boundary. (iii) the curvature gk T B k g k > 0 & minimizer m k ( αg k ), and hence Cauchy point, occurs before trust-region is reached. Consider each case in turn;

Case (i) gk T B k g k 0 & α 0 = m k ( αg k ) = f k α g k 2 + 2α 2 gk T B k g k f k α g k 2 () Cauchy point lies on boundary of the trust region = () + (2) = α C k = k g k. (2) f k m k (s C k) g k 2 k g k = g k k 2 g k k. Case (ii) = = α k (3) + (4) + (5) = def = arg min m k ( αg k ) f k α g k 2 + 2α 2 g T k B k g k (3) α k = g k 2 g T k B k g k α C k = k g k (4) α C kg T k B k g k g k 2. (5) f k m k (s C k) = α C k g k 2 2[α C k] 2 g T k B k g k 2α C k g k 2 = 2 g k 2 k g k = 2 g k k.

Case (iii) = where α C k = α k = g k 2 g T k B k g k f k m k (s C k) = αk g k 2 + 2(α k) 2 gk T B k g k = g k 4 g k 4 gk T 2 B k g k gk T B k g k g k 4 = 2 2 gk T B k g k g k 2 + B k, g T k B k g k g k 2 B k g k 2 ( + B k ) because of the Cauchy-Schwarz inequality. Corollary 3.2. If m k (s) is the second-order model, and s k is an improvement on the Cauchy point within the trust-region s k, f k m k (s k ) 2 g k min g k + B k, k.

DIFFERENCE BETWEEN MODEL AND FUNCTION Lemma 3.3. Suppose that f C 2, and that the true and model Hessians satisfy the bounds H(x) κ h for all x and B k κ b for all k and some κ h and κ b 0. Then where κ d = 2(κ h + κ b ), for all k. f(x k + s k ) m k (s k ) κ d 2 k, PROOF OF LEMMA 3.3 Mean value theorem = f(x k + s k ) = f(x k ) + s T k x f(x k ) + 2s T k xx f(ξ k )s k for some ξ k [x k, x k + s k ]. Thus f(x k + s k ) m k (s k ) = 2 s T k H(ξ k )s k s T k B k s k 2 s T k H(ξ k )s k + 2 s T k B k s k 2(κ h + κ b ) s k 2 κ d 2 k using the triangle and Cauchy-Schwarz inequalities.

ULTIMATE PROGRESS AT NON-OPTIMAL POINTS Lemma 3.4. Suppose that f C 2, that the true and model Hessians satisfy the bounds H k κ h and B k κ b for all k and some κ h and κ b 0, and that κ d = 2(κ h + κ b ). Suppose furthermore that g k 0 and that k g k min Then iteration k is very successful and, ( η v) κ h + κ b 2κ d k+ k.. PROOF OF LEMMA 3.4 By definition, + B k κ h + κ b + first bound on k = Corollary 3.2 = k g k κ h + κ b f k m k (s k ) 2 g k min + Lemma 3.3 + second bound on k = ρ k = f(x k + s k ) m k (s k ) f k m k (s k ) g k + B k. = ρ k η v = iteration is very successful. g k + B k, k = 2 g k k. 2 κ d 2 k g k k = 2 κ d k g k η v.

RADIUS WON T SHRINK TO ZERO AT NON-OPTIMAL POINTS Lemma 3.5. Suppose that f C 2, that the true and model Hessians satisfy the bounds H k κ h and B k κ b for all k and some κ h and κ b 0, and that κ d = 2(κ h + κ b ). Suppose furthermore that there exists a constant ɛ > 0 such that g k ɛ for all k. Then for all k. def k κ ɛ = ɛγ d min, ( η v) κ h + κ b 2κ d PROOF OF LEMMA 3.5 Suppose otherwise that iteration k is first for which k+ κ ɛ. k > k+ = iteration k unsuccessful = γ d k k+. Hence k ɛ min, ( η v) κ h + κ b 2κ d g k min, ( η v) κ h + κ b 2κ d But this contradicts assertion of Lemma 3.4 that iteration k must be very successful.

POSSIBLE FINITE TERMINATION Lemma 3.6. Suppose that f C 2, and that both the true and model Hessians remain bounded for all k. Suppose furthermore that there are only finitely many successful iterations. Then x k = x for all sufficiently large k and g(x ) = 0. PROOF OF LEMMA 3.6 x k0 +j = x k0 + = x for all j > 0, where k 0 is index of last successful iterate. All iterations are unsuccessful for sufficiently large k = { k } 0 + Lemma 3.4 then implies that if g k0 + > 0 there must be a successful iteration of index larger than k 0, which is impossible = g k0 + = 0.

GLOBAL CONVERGENCE OF ONE SEQUENCE Theorem 3.7. Suppose that f C 2, and that both the true and model Hessians remain bounded for all k. Then either g l = 0 for some l 0 or or lim f k = k lim inf k g k = 0. PROOF OF THEOREM 3.7 Let S be the index set of successful iterations. Lemma 3.6 = true Theorem 3.7 when S finite. So consider S =, and suppose f k bounded below and g k ɛ (6) for some ɛ > 0 and all k, and consider some k S. + Corollary 3.2, Lemma 3.5, and the assumption (6) = def ɛ f k f k+ η s [f k m k (s k )] δ ɛ = 2η s ɛ min, κ ɛ + κ b = f 0 f k+ = k [f j f j+ ] σ k δ ɛ, j=0 j S where σ k is the number of successful iterations up to iteration k. But lim k σ k = +. = f k unbounded below = a subsequence of the g k 0.

GLOBAL CONVERGENCE Theorem 3.8. Suppose that f C 2, and that both the true and model Hessians remain bounded for all k. Then either g l = 0 for some l 0 or or lim f k = k lim g k = 0. k PROOF OF THEOREM 3.8 Suppose otherwise that f k is bounded from below, and that there is a subsequence {t i } S, such that g ti 2ɛ > 0 (7) for some ɛ > 0 and for all i. Theorem 3.7 = {l i } S such that g k ɛ for t i k < l i and g li < ɛ. (8) Now restrict attention to indices in K def = {k S t i k < l i }.

g k 2ɛ ɛ S {t i} {l i} K k Figure 3.: The subsequences of the proof of Theorem 3.8 As in proof of Theorem 3.7, (8) = f k f k+ η s [f k m k (s k )] 2η s ɛ min for all k K = LHS of (9) 0 as k = lim k = 0 k k K = k 2 [f k f k+ ]. ɛη s for k K sufficiently large = x ti x li l i j=t i j K x j x j+ l i j=t i j K ɛ, k (9) + κ b j 2 ɛη s [f ti f li ]. (0) for i sufficiently large. But RHS of (0) 0 = x ti x li 0 as i tends to infinity + continuity = g ti g li 0.

Impossible as g ti g li ɛ by definition of {t i } and {l i } = no subsequence satisfying (7) can exist. II: SOLVING THE TRUST-REGION SUBPROBLEM (approximately) minimize s IR n q(s) s T g + 2s T Bs subject to s AIM: find s so that q(s ) q(s C ) and s Might solve exactly = Newton-like method approximately = steepest descent/conjugate gradients

THE l 2 -NORM TRUST-REGION SUBPROBLEM minimize s IR n q(s) s T g + 2s T Bs subject to s 2 Solution characterisation result: Theorem 3.9. Any global minimizer s of q(s) subject to s 2 satisfies the equation (B + λ I)s = g, where B+λ I is positive semi-definite, λ 0 and λ ( s 2 ) = 0. If B + λ I is positive definite, s is unique. PROOF OF THEOREM 3.9 Problem equivalent to minimizing q(s) subject to 2 2 2s T s 0. Theorem.9 = g + Bs = λ s () for some Lagrange multiplier λ 0 for which either λ = 0 or s 2 = (or both). It remains to show B + λ I is positive semi-definite. If s lies in the interior of the trust-region, λ = 0, and Theorem.0 = B + λ I = B is positive semi-definite. If s 2 = and λ = 0, Theorem.0 = v T Bv 0 for all v N + = {v s T v 0}. If v / N + = v N + = v T Bv 0 for all v. Only remaining case is where s 2 = and λ > 0. Theorem.0 = v T (B + λ I)v 0 for all v N + = {v s T v = 0} = remains to consider v T Bv when s T v 0.

s N + w s Figure 3.2: Construction of missing directions of positive curvature. Let s be any point on the boundary δr of the trust-region R, and let w = s s. Then w T s = (s s) T s = 2(s s) T (s s) = 2w T w (2) since s 2 = = s 2. () + (2) = q(s) q(s ) = w T (g + Bs ) + 2w T Bw = λ w T s + 2w T Bw = 2w T (B + λ I)w, = w T (B + λ I)w 0 since s is a global minimizer. But (3) s = s 2 st v v T v v δr = (for this s) w v = v T (B + λ I)v 0. When B + λ I is positive definite, s = (B + λ I) g. If s δr and s R, (2) and (3) become w T s 2w T w and q(s) q(s ) + 2w T (B + λ I)w respectively. Hence, q(s) > q(s ) for any s s. If s is interior, λ = 0, B is positive definite, and thus s is the unique unconstrained minimizer of q(s).

ALGORITHMS FOR THE l 2 -NORM SUBPROBLEM Two cases: B positive-semi definite and Bs = g satisfies s 2 = s = s B indefinite or Bs = g satisfies s 2 > In this case (B + λ I)s = g and s T s = 2 nonlinear (quadratic) system in s and λ concentrate on this EQUALITY CONSTRAINED l 2 -NORM SUBPROBLEM Suppose B has spectral decomposition B = U T ΛU U eigenvectors Λ diagonal eigenvalues: λ λ 2... λ n Require B + λi positive semi-definite = λ λ Define Require Note s(λ) = (B + λi) g ψ(λ) def = s(λ) 2 2 = 2 (γ i = e T i Ug) ψ(λ) = U T (Λ + λi) Ug 2 2 = n i= γ 2 i (λ i + λ) 2

CONVEX EXAMPLE ψ(λ) 3 2 0 8 6 4 2 0 2 4 λ B = solution curve as varies 2 =.5 0 0 0 3 0 0 0 5 g = NONCONVEX EXAMPLE ψ(λ) 2 B = 0 0 0 3 0 0 0 5 minus leftmost eigenvalue g = 0 8 6 4 2 0 2 4 λ

THE HARD CASE ψ(λ) 2 B = 0 0 0 3 0 0 0 5 minus leftmost eigenvalue g = 0 0 8 6 4 2 0 2 4 2 = 0.0903 λ SUMMARY For indefinite B, Hard case occurs when g orthogonal to eigenvector u for most negative eigenvalue λ OK if radius is radius small enough No obvious solution to equations... but solution is actually of the form where s lim = lim λ + λ s(λ) s lim + σu 2 = s lim + σu

HOW TO SOLVE s(λ) 2 = DON T!! Solve instead the secular equation no poles φ(λ) def = s(λ) 2 = 0 smallest at eigenvalues (except in hard case!) analytic function = ideal for Newton global convergent (ultimately quadratic rate except in hard case) need to safeguard to protect Newton from the hard & interior solution cases THE SECULAR EQUATION φ(λ) min 4s 2 + 4s 2 2 + 2s + s 2 subject to s 2 4 0 0 λ λ λ

NEWTON S METHOD FOR SECULAR EQUATION Newton correction at λ is φ(λ)/φ (λ). Differentiating φ(λ) = s(λ) 2 = (s T (λ)s(λ)) 2 = φ (λ) = st (λ) λ s(λ) (s T (λ)s(λ)) 3 2 Differentiating the defining equation = st (λ) λ s(λ). s(λ) 3 2 (B + λi)s(λ) = g = (B + λi) λ s(λ) + s(λ) = 0. Notice that, rather than λ s(λ), merely s T (λ) λ s(λ) = s T (λ)(b + λi)(λ) s(λ) required for φ (λ). Given the factorization B + λi = L(λ)L T (λ) = s T (λ)(b + λi) s(λ) = s T (λ)l T (λ)l (λ)s(λ) = (L (λ)s(λ)) T (L (λ)s(λ)) = w(λ) 2 2 where L(λ)w(λ) = s(λ). NEWTON S METHOD & THE SECULAR EQUATION Let λ > λ and > 0 be given Until convergence do: Factorize B + λi = LL T Solve LL T s = g Solve Lw = s Replace λ by λ + s 2 s 2 2 w 2

SOLVING THE LARGE-SCALE PROBLEM when n is large, factorization may be impossible may instead try to use an iterative method to approximate Steepest descent leads to the Cauchy point obvious generalization: conjugate gradients... but what about the trust region? what about negative curvature? CONJUGATE GRADIENTS TO MINIMIZE q(s) Given s 0 = 0, set g 0 = g, d 0 = g and i = 0 Until g i small or breakdown, iterate α i = g i 2 2/d i T Bd i s i+ = s i + α i d i g i+ = g i + α i Bd i β i = g i+ 2 2/ g i 2 2 d i+ = g i+ + β i d i and increase i by Important features g j = Bs j + g for all j = 0,..., i d j T g i+ = 0 for all j = 0,..., i g j T g i+ = 0 for all j = 0,..., i

CRUCIAL PROPERTY OF CONJUGATE GRADIENTS Theorem 3.0. Suppose that the conjugate gradient method is applied to minimize q(s) starting from s 0 = 0, and that d i T Bd i > 0 for 0 i k. Then the iterates s j satisfy the inequalities for 0 j k. s j 2 < s j+ 2 PROOF OF THEOREM 3.0 First show that d i T d j = gi 2 2 d j 2 g j 2 2 > 0 (4) for all 0 j i k. For any i, (4) is trivially true for j = i. Suppose it is also true for all i l. Then, the update for d l+ gives d l+ = g l+ + gl+ 2 2 d l. g l 2 Forming the inner product with d j, and using the fact that d j T g l+ = 0 for all j = 0,..., l, and (4) when j = l, reveals d l+ T d j = g l+ T d j + gl+ 2 2 d l T d j g l 2 = gl+ 2 2 g l 2 2 d j 2 g l 2 g j 2 2 = gl+ 2 2 d j 2 g j 2 2 > 0. Thus (4) is true for i l +, and hence for all 0 j i k.

Now have from the algorithm that s i = s 0 + i j=0 αj d j = i as, by assumption, s 0 = 0. Hence s i T d i = i j=0 αj d j T d i = i j=0 αj d j j=0 αj d j T d i > 0 (5) as each α j > 0, which follows from the definition of α j, since d j T Hd j > 0, and from relationship (4). Hence s i+ 2 2 = s i+ T s i+ = ( s i + α i d i) T ( s i + α i d i) = s i T s i + 2α i s i T d i + α i 2 d i T d i > s i T s i = s i 2 2 follows directly from (5) and α i > 0 which is the required result. TRUNCATED CONJUGATE GRADIENTS Apply the conjugate gradient method, but terminate at iteration i if. d i T Bd i 0 = problem unbounded along d i 2. s i + α i d i 2 > = solution on trust-region boundary In both cases, stop with s = s i + α B d i, where α B root of s i + α B d i 2 = chosen as positive Crucially q(s ) q(s C ) and s 2 = TR algorithm converges to a first-order critical point

HOW GOOD IS TRUNCATED C.G.? In the convex case... very good Theorem 3.. Suppose that the truncated conjugate gradient method is applied to minimize q(s) and that B is positive definite. Then the computed and actual solutions to the problem, s and s M, satisfy the bound q(s ) 2q(s M ) In the non-convex case... maybe poor e.g., if g = 0 and B is indefinite = q(s ) = 0 WHAT CAN WE DO IN THE NON-CONVEX CASE? Solve the problem over a subspace instead of the B-conjugate subspace for CG, use the equivalent Lanczos orthogonal basis Gram-Schmidt applied to CG (Krylov) basis D i Subspace Q i = {s s = Q i s q for some s q IR i } Q i is such that Q i T Q i = I and Q i T BQ i = T i where T i is tridiagonal and Q i T g = g 2 e Q i trivial to generate from CG D i

GENERALIZED LANCZOS TRUST-REGION METHOD = s i = Q i s i q, where s i = arg min s Q i q(s) subject to s 2 s i q = arg min s q IR i g 2 e T s q + 2s T q T i s q subject to s q 2 advantage T i has very sparse factors = can solve the problem using the earlier secular equation approach can exploit all the structure here = use solution for one problem to initialize next until the trust-region boundary is reached, it is conjugate gradients = switch when we get there