Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Size: px

Start display at page:

Download "Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)"

Laura Davidson
5 years ago
Views:

1 Part 3: Trust-region methods for unconstrained optimization Nick Gould (RAL) minimize x IR n f(x) MSc course on nonlinear optimization UNCONSTRAINED MINIMIZATION minimize x IR n f(x) where the objective function f : IR n IR assume that f C (sometimes C 2 ) and Lipschitz often in practice this assumption violated, but not necessary

2 LINESEARCH VS TRUST-REGION METHODS Linesearch methods pick descent direction p k pick stepsize α k to reduce f(x k + αp k ) x k+ = x k + α k p k Trust-region methods pick step s k to reduce model of f(x k + s) accept x k+ = x k +s k if decrease in model inherited by f(x k +s k ) otherwise set x k+ = x k, refine model TRUST-REGION MODEL PROBLEM Model f(x k + s) by: linear model m L k (s) = f k + s T g k quadratic model symmetric B k Major difficulties: m Q k (s) = f k + s T g k + 2s T B k s models may not resemble f(x k + s) if s is large models may be unbounded from below linear model - always unless g k = 0 quadratic model - always if B k is indefinite, possibly if B k is only positive semi-definite

3 THE TRUST REGION Prevent model m k (s) from unboundedness by imposing a trust-region constraint s k for some suitable scalar radius k > 0 = trust-region subproblem approx minimize s IR n in theory does not depend on norm in practice it might! m k (s) subject to s k OUR MODEL For simplicity, concentrate on the second-order (Newton-like) model m k (s) = m Q k (s) = f k + s T g k + 2s T B k s and the l 2 -trust region norm = 2 Note: B k = H k is allowed analysis for other trust-region norms simply adds extra constants in following results

4 BASIC TRUST-REGION METHOD Given k = 0, 0 > 0 and x 0, until convergence do: Build the second-order model m(s) of f(x k + s). Solve the trust-region subproblem to find s k for which m(s k ) < f k and s k k, and define ρ k = f k f(x k + s k ). f k m k (s k ) If ρ k η v [very successful] 0 < η v < set x k+ = x k + s k and k+ = γ i k γ i Otherwise if ρ k η s then [successful] 0 < η s η v < set x k+ = x k + s k and k+ = k Otherwise [unsuccessful] set x k+ = x k and k+ = γ d k 0 < γ d < Increase k by SOLVE THE TRUST REGION SUBPROBLEM? At the very least aim to achieve as much reduction in the model as would an iteration of steepest descent Cauchy point: s C k = α C kg k where αk C = arg min m k ( αg k ) subject to α g k k α>0 = arg min m k ( αg k ) 0<α k / g k minimize quadratic on line segment = very easy! require that m k (s k ) m k (s C k) and s k k in practice, hope to do far better than this

5 ACHIEVABLE MODEL DECREASE Theorem 3.. If m k (s) is the second-order model and s C k is its Cauchy point within the trust-region s k, f k m k (s C k) 2 g k min g k + B k, k. PROOF OF THEOREM 3. m k ( αg k ) = f k α g k 2 + 2α 2 g T k B k g k. Result immediate if g k = 0. Otherwise, 3 possibilities (i) curvature g T k B k g k 0 = m k ( αg k ) unbounded from below as α increases = Cauchy point occurs on the trust-region boundary. (ii) curvature g T k B k g k > 0 & minimizer m k ( αg k ) occurs at or beyond the trust-region boundary = Cauchy point occurs on the trustregion boundary. (iii) the curvature gk T B k g k > 0 & minimizer m k ( αg k ), and hence Cauchy point, occurs before trust-region is reached. Consider each case in turn;

6 Case (i) gk T B k g k 0 & α 0 = m k ( αg k ) = f k α g k 2 + 2α 2 gk T B k g k f k α g k 2 () Cauchy point lies on boundary of the trust region = () + (2) = α C k = k g k. (2) f k m k (s C k) g k 2 k g k = g k k 2 g k k. Case (ii) = = α k (3) + (4) + (5) = def = arg min m k ( αg k ) f k α g k 2 + 2α 2 g T k B k g k (3) α k = g k 2 g T k B k g k α C k = k g k (4) α C kg T k B k g k g k 2. (5) f k m k (s C k) = α C k g k 2 2[α C k] 2 g T k B k g k 2α C k g k 2 = 2 g k 2 k g k = 2 g k k.

7 Case (iii) = where α C k = α k = g k 2 g T k B k g k f k m k (s C k) = αk g k 2 + 2(α k) 2 gk T B k g k = g k 4 g k 4 gk T 2 B k g k gk T B k g k g k 4 = 2 2 gk T B k g k g k 2 + B k, g T k B k g k g k 2 B k g k 2 ( + B k ) because of the Cauchy-Schwarz inequality. Corollary 3.2. If m k (s) is the second-order model, and s k is an improvement on the Cauchy point within the trust-region s k, f k m k (s k ) 2 g k min g k + B k, k.

8 DIFFERENCE BETWEEN MODEL AND FUNCTION Lemma 3.3. Suppose that f C 2, and that the true and model Hessians satisfy the bounds H(x) κ h for all x and B k κ b for all k and some κ h and κ b 0. Then where κ d = 2(κ h + κ b ), for all k. f(x k + s k ) m k (s k ) κ d 2 k, PROOF OF LEMMA 3.3 Mean value theorem = f(x k + s k ) = f(x k ) + s T k x f(x k ) + 2s T k xx f(ξ k )s k for some ξ k [x k, x k + s k ]. Thus f(x k + s k ) m k (s k ) = 2 s T k H(ξ k )s k s T k B k s k 2 s T k H(ξ k )s k + 2 s T k B k s k 2(κ h + κ b ) s k 2 κ d 2 k using the triangle and Cauchy-Schwarz inequalities.

9 ULTIMATE PROGRESS AT NON-OPTIMAL POINTS Lemma 3.4. Suppose that f C 2, that the true and model Hessians satisfy the bounds H k κ h and B k κ b for all k and some κ h and κ b 0, and that κ d = 2(κ h + κ b ). Suppose furthermore that g k 0 and that k g k min Then iteration k is very successful and, ( η v) κ h + κ b 2κ d k+ k.. PROOF OF LEMMA 3.4 By definition, + B k κ h + κ b + first bound on k = Corollary 3.2 = k g k κ h + κ b f k m k (s k ) 2 g k min + Lemma second bound on k = ρ k = f(x k + s k ) m k (s k ) f k m k (s k ) g k + B k. = ρ k η v = iteration is very successful. g k + B k, k = 2 g k k. 2 κ d 2 k g k k = 2 κ d k g k η v.

10 RADIUS WON T SHRINK TO ZERO AT NON-OPTIMAL POINTS Lemma 3.5. Suppose that f C 2, that the true and model Hessians satisfy the bounds H k κ h and B k κ b for all k and some κ h and κ b 0, and that κ d = 2(κ h + κ b ). Suppose furthermore that there exists a constant ɛ > 0 such that g k ɛ for all k. Then for all k. def k κ ɛ = ɛγ d min, ( η v) κ h + κ b 2κ d PROOF OF LEMMA 3.5 Suppose otherwise that iteration k is first for which k+ κ ɛ. k > k+ = iteration k unsuccessful = γ d k k+. Hence k ɛ min, ( η v) κ h + κ b 2κ d g k min, ( η v) κ h + κ b 2κ d But this contradicts assertion of Lemma 3.4 that iteration k must be very successful.

11 POSSIBLE FINITE TERMINATION Lemma 3.6. Suppose that f C 2, and that both the true and model Hessians remain bounded for all k. Suppose furthermore that there are only finitely many successful iterations. Then x k = x for all sufficiently large k and g(x ) = 0. PROOF OF LEMMA 3.6 x k0 +j = x k0 + = x for all j > 0, where k 0 is index of last successful iterate. All iterations are unsuccessful for sufficiently large k = { k } 0 + Lemma 3.4 then implies that if g k0 + > 0 there must be a successful iteration of index larger than k 0, which is impossible = g k0 + = 0.

12 GLOBAL CONVERGENCE OF ONE SEQUENCE Theorem 3.7. Suppose that f C 2, and that both the true and model Hessians remain bounded for all k. Then either g l = 0 for some l 0 or or lim f k = k lim inf k g k = 0. PROOF OF THEOREM 3.7 Let S be the index set of successful iterations. Lemma 3.6 = true Theorem 3.7 when S finite. So consider S =, and suppose f k bounded below and g k ɛ (6) for some ɛ > 0 and all k, and consider some k S. + Corollary 3.2, Lemma 3.5, and the assumption (6) = def ɛ f k f k+ η s [f k m k (s k )] δ ɛ = 2η s ɛ min, κ ɛ + κ b = f 0 f k+ = k [f j f j+ ] σ k δ ɛ, j=0 j S where σ k is the number of successful iterations up to iteration k. But lim k σ k = +. = f k unbounded below = a subsequence of the g k 0.

13 GLOBAL CONVERGENCE Theorem 3.8. Suppose that f C 2, and that both the true and model Hessians remain bounded for all k. Then either g l = 0 for some l 0 or or lim f k = k lim g k = 0. k PROOF OF THEOREM 3.8 Suppose otherwise that f k is bounded from below, and that there is a subsequence {t i } S, such that g ti 2ɛ > 0 (7) for some ɛ > 0 and for all i. Theorem 3.7 = {l i } S such that g k ɛ for t i k < l i and g li < ɛ. (8) Now restrict attention to indices in K def = {k S t i k < l i }.

14 g k 2ɛ ɛ S {t i} {l i} K k Figure 3.: The subsequences of the proof of Theorem 3.8 As in proof of Theorem 3.7, (8) = f k f k+ η s [f k m k (s k )] 2η s ɛ min for all k K = LHS of (9) 0 as k = lim k = 0 k k K = k 2 [f k f k+ ]. ɛη s for k K sufficiently large = x ti x li l i j=t i j K x j x j+ l i j=t i j K ɛ, k (9) + κ b j 2 ɛη s [f ti f li ]. (0) for i sufficiently large. But RHS of (0) 0 = x ti x li 0 as i tends to infinity + continuity = g ti g li 0.

15 Impossible as g ti g li ɛ by definition of {t i } and {l i } = no subsequence satisfying (7) can exist. II: SOLVING THE TRUST-REGION SUBPROBLEM (approximately) minimize s IR n q(s) s T g + 2s T Bs subject to s AIM: find s so that q(s ) q(s C ) and s Might solve exactly = Newton-like method approximately = steepest descent/conjugate gradients

16 THE l 2 -NORM TRUST-REGION SUBPROBLEM minimize s IR n q(s) s T g + 2s T Bs subject to s 2 Solution characterisation result: Theorem 3.9. Any global minimizer s of q(s) subject to s 2 satisfies the equation (B + λ I)s = g, where B+λ I is positive semi-definite, λ 0 and λ ( s 2 ) = 0. If B + λ I is positive definite, s is unique. PROOF OF THEOREM 3.9 Problem equivalent to minimizing q(s) subject to 2 2 2s T s 0. Theorem.9 = g + Bs = λ s () for some Lagrange multiplier λ 0 for which either λ = 0 or s 2 = (or both). It remains to show B + λ I is positive semi-definite. If s lies in the interior of the trust-region, λ = 0, and Theorem.0 = B + λ I = B is positive semi-definite. If s 2 = and λ = 0, Theorem.0 = v T Bv 0 for all v N + = {v s T v 0}. If v / N + = v N + = v T Bv 0 for all v. Only remaining case is where s 2 = and λ > 0. Theorem.0 = v T (B + λ I)v 0 for all v N + = {v s T v = 0} = remains to consider v T Bv when s T v 0.

17 s N + w s Figure 3.2: Construction of missing directions of positive curvature. Let s be any point on the boundary δr of the trust-region R, and let w = s s. Then w T s = (s s) T s = 2(s s) T (s s) = 2w T w (2) since s 2 = = s 2. () + (2) = q(s) q(s ) = w T (g + Bs ) + 2w T Bw = λ w T s + 2w T Bw = 2w T (B + λ I)w, = w T (B + λ I)w 0 since s is a global minimizer. But (3) s = s 2 st v v T v v δr = (for this s) w v = v T (B + λ I)v 0. When B + λ I is positive definite, s = (B + λ I) g. If s δr and s R, (2) and (3) become w T s 2w T w and q(s) q(s ) + 2w T (B + λ I)w respectively. Hence, q(s) > q(s ) for any s s. If s is interior, λ = 0, B is positive definite, and thus s is the unique unconstrained minimizer of q(s).

18 ALGORITHMS FOR THE l 2 -NORM SUBPROBLEM Two cases: B positive-semi definite and Bs = g satisfies s 2 = s = s B indefinite or Bs = g satisfies s 2 > In this case (B + λ I)s = g and s T s = 2 nonlinear (quadratic) system in s and λ concentrate on this EQUALITY CONSTRAINED l 2 -NORM SUBPROBLEM Suppose B has spectral decomposition B = U T ΛU U eigenvectors Λ diagonal eigenvalues: λ λ 2... λ n Require B + λi positive semi-definite = λ λ Define Require Note s(λ) = (B + λi) g ψ(λ) def = s(λ) 2 2 = 2 (γ i = e T i Ug) ψ(λ) = U T (Λ + λi) Ug 2 2 = n i= γ 2 i (λ i + λ) 2

19 CONVEX EXAMPLE ψ(λ) λ B = solution curve as varies 2 = g = NONCONVEX EXAMPLE ψ(λ) 2 B = minus leftmost eigenvalue g = λ

20 THE HARD CASE ψ(λ) 2 B = minus leftmost eigenvalue g = = λ SUMMARY For indefinite B, Hard case occurs when g orthogonal to eigenvector u for most negative eigenvalue λ OK if radius is radius small enough No obvious solution to equations... but solution is actually of the form where s lim = lim λ + λ s(λ) s lim + σu 2 = s lim + σu

21 HOW TO SOLVE s(λ) 2 = DON T!! Solve instead the secular equation no poles φ(λ) def = s(λ) 2 = 0 smallest at eigenvalues (except in hard case!) analytic function = ideal for Newton global convergent (ultimately quadratic rate except in hard case) need to safeguard to protect Newton from the hard & interior solution cases THE SECULAR EQUATION φ(λ) min 4s 2 + 4s s + s 2 subject to s λ λ λ

22 NEWTON S METHOD FOR SECULAR EQUATION Newton correction at λ is φ(λ)/φ (λ). Differentiating φ(λ) = s(λ) 2 = (s T (λ)s(λ)) 2 = φ (λ) = st (λ) λ s(λ) (s T (λ)s(λ)) 3 2 Differentiating the defining equation = st (λ) λ s(λ). s(λ) 3 2 (B + λi)s(λ) = g = (B + λi) λ s(λ) + s(λ) = 0. Notice that, rather than λ s(λ), merely s T (λ) λ s(λ) = s T (λ)(b + λi)(λ) s(λ) required for φ (λ). Given the factorization B + λi = L(λ)L T (λ) = s T (λ)(b + λi) s(λ) = s T (λ)l T (λ)l (λ)s(λ) = (L (λ)s(λ)) T (L (λ)s(λ)) = w(λ) 2 2 where L(λ)w(λ) = s(λ). NEWTON S METHOD & THE SECULAR EQUATION Let λ > λ and > 0 be given Until convergence do: Factorize B + λi = LL T Solve LL T s = g Solve Lw = s Replace λ by λ + s 2 s 2 2 w 2

23 SOLVING THE LARGE-SCALE PROBLEM when n is large, factorization may be impossible may instead try to use an iterative method to approximate Steepest descent leads to the Cauchy point obvious generalization: conjugate gradients... but what about the trust region? what about negative curvature? CONJUGATE GRADIENTS TO MINIMIZE q(s) Given s 0 = 0, set g 0 = g, d 0 = g and i = 0 Until g i small or breakdown, iterate α i = g i 2 2/d i T Bd i s i+ = s i + α i d i g i+ = g i + α i Bd i β i = g i+ 2 2/ g i 2 2 d i+ = g i+ + β i d i and increase i by Important features g j = Bs j + g for all j = 0,..., i d j T g i+ = 0 for all j = 0,..., i g j T g i+ = 0 for all j = 0,..., i

24 CRUCIAL PROPERTY OF CONJUGATE GRADIENTS Theorem 3.0. Suppose that the conjugate gradient method is applied to minimize q(s) starting from s 0 = 0, and that d i T Bd i > 0 for 0 i k. Then the iterates s j satisfy the inequalities for 0 j k. s j 2 < s j+ 2 PROOF OF THEOREM 3.0 First show that d i T d j = gi 2 2 d j 2 g j 2 2 > 0 (4) for all 0 j i k. For any i, (4) is trivially true for j = i. Suppose it is also true for all i l. Then, the update for d l+ gives d l+ = g l+ + gl+ 2 2 d l. g l 2 Forming the inner product with d j, and using the fact that d j T g l+ = 0 for all j = 0,..., l, and (4) when j = l, reveals d l+ T d j = g l+ T d j + gl+ 2 2 d l T d j g l 2 = gl+ 2 2 g l 2 2 d j 2 g l 2 g j 2 2 = gl+ 2 2 d j 2 g j 2 2 > 0. Thus (4) is true for i l +, and hence for all 0 j i k.

25 Now have from the algorithm that s i = s 0 + i j=0 αj d j = i as, by assumption, s 0 = 0. Hence s i T d i = i j=0 αj d j T d i = i j=0 αj d j j=0 αj d j T d i > 0 (5) as each α j > 0, which follows from the definition of α j, since d j T Hd j > 0, and from relationship (4). Hence s i+ 2 2 = s i+ T s i+ = ( s i + α i d i) T ( s i + α i d i) = s i T s i + 2α i s i T d i + α i 2 d i T d i > s i T s i = s i 2 2 follows directly from (5) and α i > 0 which is the required result. TRUNCATED CONJUGATE GRADIENTS Apply the conjugate gradient method, but terminate at iteration i if. d i T Bd i 0 = problem unbounded along d i 2. s i + α i d i 2 > = solution on trust-region boundary In both cases, stop with s = s i + α B d i, where α B root of s i + α B d i 2 = chosen as positive Crucially q(s ) q(s C ) and s 2 = TR algorithm converges to a first-order critical point

26 HOW GOOD IS TRUNCATED C.G.? In the convex case... very good Theorem 3.. Suppose that the truncated conjugate gradient method is applied to minimize q(s) and that B is positive definite. Then the computed and actual solutions to the problem, s and s M, satisfy the bound q(s ) 2q(s M ) In the non-convex case... maybe poor e.g., if g = 0 and B is indefinite = q(s ) = 0 WHAT CAN WE DO IN THE NON-CONVEX CASE? Solve the problem over a subspace instead of the B-conjugate subspace for CG, use the equivalent Lanczos orthogonal basis Gram-Schmidt applied to CG (Krylov) basis D i Subspace Q i = {s s = Q i s q for some s q IR i } Q i is such that Q i T Q i = I and Q i T BQ i = T i where T i is tridiagonal and Q i T g = g 2 e Q i trivial to generate from CG D i

27 GENERALIZED LANCZOS TRUST-REGION METHOD = s i = Q i s i q, where s i = arg min s Q i q(s) subject to s 2 s i q = arg min s q IR i g 2 e T s q + 2s T q T i s q subject to s q 2 advantage T i has very sparse factors = can solve the problem using the earlier secular equation approach can exploit all the structure here = use solution for one problem to initialize next until the trust-region boundary is reached, it is conjugate gradients = switch when we get there

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2 Coralia Cartis, University of Oxford INFOMM CDT: Modelling, Analysis and Computation of Continuous Real-World Problems Methods