Inexact Newton Methods and Nonlinear Constrained Optimization

Inexact Newton Methods and Nonlinear Constrained Optimization Frank E. Curtis EPSRC Symposium Capstone Conference Warwick Mathematics Institute July 2, 2009

Outline PDE-Constrained Optimization Newton s method Inexactness Experimental results Conclusion and final remarks

Hyperthermia treatment Regional hyperthermia is a cancer therapy that aims at heating large and deeply seated tumors by means of radio wave adsorption Results in the killing of tumor cells and makes them more susceptible to other accompanying therapies; e.g., chemotherapy

Hyperthermia treatment planning Computer modeling can be used to help plan the therapy for each patient, and it opens the door for numerical optimization The goal is to heat the tumor to a target temperature of 43 C while minimizing damage to nearby cells

PDE-constrained optimization min f (x) s.t. c E(x) = 0 c I(x) 0 Problem is infinite-dimensional Controls and states: x = (u, y) Solution methods integrate numerical simulation problem structure optimization algorithms

Algorithmic frameworks We hear the phrases: Discretize-then-optimize Optimize-then-discretize I prefer: Discretize the optimization problem min f (x) s.t. c(x) = 0 Discretize the optimality conditions min f h(x) s.t. c h (x) = 0 min f (x) s.t. c(x) = 0 [ f + A, λ c ] = 0 [ ] ( f + A, λ )h = 0 c h Discretize the search direction computation

Algorithms Nonlinear elimination min u,y f (u, y) s.t. c(u, y) = 0 min u f (u, y(u)) uf + uy T y f = 0 Reduced-space methods d y : toward satisfying the constraints λ : Lagrange multiplier estimates d u : toward optimality Full-space methods H u 0 A T u d u uf + A T 0 H y A T u λ y d y = y f + A T y λ A u A y 0 δ c

Outline PDE-Constrained Optimization Newton s method Inexactness Experimental results Conclusion and final remarks

Nonlinear equations Newton s method F(x) = 0 F(x k )d k = F(x k ) Judge progress by the merit function φ(x) 1 F(x 2 k) 2 Direction is one of descent since φ(x k ) T d k = F(x k ) T F(x k )d k = F(x k ) 2 < 0 (Note the consistency between the step computation and merit function!)

Equality constrained optimization Consider Lagrangian is min f (x) x R n s.t. c(x) = 0 L(x, λ) f (x) + λ T c(x) so the first-order optimality conditions are [ ] f (x) + c(x)λ L(x, λ) = F(x, λ) = 0 c(x)

Merit function Simply minimizing [ ] ϕ(x, λ) = 1 F(x, 2 λ) 2 = 1 f (x) + c(x)λ 2 2 c(x) is generally inappropriate for constrained optimization We use the merit function φ(x; π) f (x) + π c(x) where π is a penalty parameter

Minimizing a penalty function Consider the penalty function for min (x 1) 2, s.t. x = 0 i.e. φ(x; π) = (x 1) 2 + π x for different values of the penalty parameter π Figure: π = 1 Figure: π = 2

Algorithm 0: Newton method for optimization (Assume the problem is sufficiently convex and regular) for k = 0, 1, 2,... Solve the primal-dual (Newton) equations [ ] [ ] [ ] H(xk, λ k ) c(x k ) dk f (xk ) + c(x c(x k ) T = k )λ k 0 c(x k ) δ k Increase π, if necessary, so that Dφ k (d k ; π k ) 0 (e.g., π k λ k + δ k ) Backtrack from α k 1 to satisfy the Armijo condition φ(x k + α k d k ; π k ) φ(x k ; π k ) + ηα k Dφ k (d k ; π k ) Update iterate (x k+1, λ k+1 ) (x k, λ k ) + α k (d k, δ k )

Convergence of Algorithm 0 Assumption The sequence {(x k, λ k )} is contained in a convex set Ω over which f, c, and their first derivatives are bounded and Lipschitz continuous. Also, (Regularity) c(x k ) T has full row rank with singular values bounded below by a positive constant (Convexity) u T H(x k, λ k )u µ u 2 for µ > 0 for all u R n satisfying u 0 and c(x k ) T u = 0 Theorem (Han (1977)) The sequence {(x k, λ k )} yields the limit [ ] lim f (xk ) + c(x k )λ k k = 0 c(x k )

Outline PDE-Constrained Optimization Newton s method Inexactness Experimental results Conclusion and final remarks

Large-scale primal-dual algorithms Computational issues: Large matrices to be stored Large matrices to be factored Algorithmic issues: The problem may be nonconvex The problem may be ill-conditioned Computational/Algorithmic issues: No matrix factorizations makes difficulties more difficult

Nonlinear equations Compute F(x k )d k = F(x k ) + r k requiring (Dembo, Eisenstat, Steihaug (1982)) r k κ F(x k ), κ (0, 1) Progress judged by the merit function φ(x) 1 F(x 2 k) 2 Again, note the consistency... φ(x k ) T d k = F(x k ) T F(x k )d k = F(x k ) 2 +F(x k ) T r k (κ 1) F(x k ) 2 < 0

Optimization Compute [ ] [ ] [ ] H(xk, λ k ) c(x k ) dk f (xk ) + c(x c(x k ) T = k )λ k + 0 c(x k ) δ k [ ρk r k ] satisfying [ ρk r k ] [ ] κ f (xk ) + c(x k )λ k, κ (0, 1) c(x k ) If κ is not sufficiently small (e.g., 10 3 vs. 10 12 ), then d k may be an ascent direction for our merit function; i.e., Dφ k (d k ; π k ) > 0 for all π k π k 1 Our work begins here... inexact Newton methods for optimization We cover the convex case, nonconvexity, irregularity, inequality constraints

Model reductions Define the model of φ(x; π): m(d; π) f (x) + f (x) T d + π( c(x) + c(x) T d ) d k is acceptable if m(d k ; π k ) m(0; π k ) m(d k ; π k ) = f (x k ) T d k + π k ( c(x k ) c(x k ) + c(x k ) T d k ) 0 This ensures Dφ k (d k ; π k ) 0 (and more)

Termination test 1 The search direction (d k, δ k ) is acceptable if [ ] [ ] ρk κ f (xk ) + c(x k )λ k, κ (0, 1) c(x k ) r k and if for π k = π k 1 and some σ (0, 1) we have m(d k ; π k ) max{ 1 2 d T k H(x k, λ k )d k, 0} + σπ k max{ c(x k ), r k c(x k ) } }{{} 0 for any d

Termination test 2 The search direction (d k, δ k ) is acceptable if ρ k β c(x k ), β > 0 and r k ɛ c(x k ), ɛ (0, 1) Increasing the penalty parameter π then yields m(d k ; π k ) max{ 1 2 d T k H(x k, λ k )d k, 0} + σπ k c(x k ) }{{} 0 for any d

Algorithm 1: Inexact Newton for optimization (Byrd, Curtis, Nocedal (2008)) for k = 0, 1, 2,... Iteratively solve [ ] [ ] [ ] H(xk, λ k ) c(x k ) dk f (xk ) + c(x c(x k ) T = k )λ k 0 c(x k ) until termination test 1 or 2 is satisfied If only termination test 2 is satisfied, increase π so { π k max π k 1, f (x k) T d k + max{ 1 d } T 2 k H(x k, λ k )d k, 0} (1 τ)( c(x k ) r k ) Backtrack from α k 1 to satisfy δ k φ(x k + α k d k ; π k ) φ(x k ; π k ) ηα k m(d k ; π k ) Update iterate (x k+1, λ k+1 ) (x k, λ k ) + α k (d k, δ k )

Convergence of Algorithm 1 Assumption The sequence {(x k, λ k )} is contained in a convex set Ω over which f, c, and their first derivatives are bounded and Lipschitz continuous. Also, (Regularity) c(x k ) T has full row rank with singular values bounded below by a positive constant (Convexity) u T H(x k, λ k )u µ u 2 for µ > 0 for all u R n satisfying u 0 and c(x k ) T u = 0 Theorem (Byrd, Curtis, Nocedal (2008)) The sequence {(x k, λ k )} yields the limit [ ] lim f (xk ) + c(x k )λ k k = 0 c(x k )

Handling nonconvexity and rank deficiency There are two assumptions we aim to drop: (Regularity) c(xk ) T has full row rank with singular values bounded below by a positive constant (Convexity) u T H(x k, λ k )u µ u 2 for µ > 0 for all u R n satisfying u 0 and c(x k ) T u = 0 e.g., the problem is not regular if it is infeasible, and it is not convex if there are maximizers and/or saddle points Without them, Algorithm 1 may stall or may not be well-defined

No factorizations means no clue We might not store or factor [ H(xk, λ k ) ] c(x k ) c(x k ) T 0 so we might not know if the problem is nonconvex or ill-conditioned Common practice is to perturb the matrix to be [ ] H(xk, λ k ) + ξ 1I c(x k ) c(x k ) T ξ 2I where ξ 1 convexifies the model and ξ 2 regularizes the constraints Poor choices of ξ 1 and ξ 2 can have terrible consequences in the algorithm

Our approach for global convergence Decompose the direction d k into a normal component (toward the constraints) and a tangential component (toward optimality) We impose a specific type of trust region constraint on the v k step in case the constraint Jacobian is (near) rank deficient

Handling nonconvexity In computation of d k = v k + u k, convexify the Hessian as in [ H(xk, λ k ) + ξ 1I ] c(x k ) c(x k ) T 0 by monitoring iterates Hessian modification strategy: Increase ξ 1 whenever u k 2 > ψ v k 2, ψ > 0 1 2 ut k (H(x k, λ k ) + ξ 1I )u k < θ u k 2, θ > 0

Algorithm 2: Inexact Newton (Regularized) (Curtis, Nocedal, Wächter (2009)) for k = 0, 1, 2,... Approximately solve min 1 2 c(x k) + c(x k ) T v 2, s.t. v ω ( c(x k ))c(x k ) to compute v k satisfying Cauchy decrease Iteratively solve [ ] [ ] [ ] H(xk, λ k ) + ξ 1 I c(x k ) dk f (xk ) + c(x c(x k ) T = k )λ k 0 δ k c(x k ) T v k until termination test 1 or 2 is satisfied, increasing ξ 1 as described If only termination test 2 is satisfied, increase π so { π k max π k 1, f (x k) T d k + max{ 1 2 ut k (H(x } k, λ k ) + ξ 1 I )u k, θ u k 2 } (1 τ)( c(x k ) c(x k ) + c(x k ) T d k ) Backtrack from α k 1 to satisfy φ(x k + α k d k ; π k ) φ(x k ; π k ) ηα k m(d k ; π k ) Update iterate (x k+1, λ k+1 ) (x k, λ k ) + α k (d k, δ k )

Convergence of Algorithm 2 Assumption The sequence {(x k, λ k )} is contained in a convex set Ω over which f, c, and their first derivatives are bounded and Lipschitz continuous Theorem (Curtis, Nocedal, Wächter (2009)) If all limit points of { c(x k ) T } have full row rank, then the sequence {(x k, λ k )} yields the limit [ ] lim f (xk ) + c(x k )λ k k = 0. c(x k ) Otherwise, and if {π k } is bounded, then lim ( c(x k))c(x k ) = 0 k lim f (x k) + c(x k )λ k = 0 k

Handling inequalities Interior point methods are attractive for large applications Line-search interior point methods that enforce c(x k ) + c(x k ) T d k = 0 may fail to converge globally (Wächter, Biegler (2000)) Fortunately, the trust region subproblem we use to regularize the constraints also saves us from this type of failure!

Algorithm 2 (Interior-point version) Apply Algorithm 2 to the logarithmic-barrier subproblem for µ 0 Define min f (x) µ q ln s i, s.t. c E (x) = 0, c I (x) s = 0 i=1 H(x k, λ E,k, λ I,k ) 0 c E (x k ) c I (x k ) d x k 0 µi 0 S k c E (x k ) T d s k 0 0 0 δ E,k c I (x k ) T S k 0 0 δ I,k so that the iterate update has [ ] [ ] [ ] xk+1 xk d x + α k s k+1 s k k S k dk s Incorporate a fraction-to-the-boundary rule in the line search and a slack reset in the algorithm to maintain s max{0, c I (x)}

Convergence of Algorithm 2 (Interior-point) Assumption The sequence {(x k, λ E,k, λ I,k )} is contained in a convex set Ω over which f, c E, c I, and their first derivatives are bounded and Lipschitz continuous Theorem (Curtis, Schenk, Wächter (2009)) For a given µ, Algorithm 2 yields the same limits as in the equality constrained case If Algorithm 2 yields a sufficiently accurate solution to the barrier subproblem for each {µ j } 0 and if the linear independence constraint qualification (LICQ) holds at a limit point x of {x j }, then there exist Lagrange multipliers λ such that the first-order optimality conditions of the nonlinear program are satisfied

Outline PDE-Constrained Optimization Newton s method Inexactness Experimental results Conclusion and final remarks

Implementation details Incorporated in IPOPT software package (Wächter) inexact algorithm yes Linear systems solved with PARDISO (Schenk) SQMR (Freund (1994)) Preconditioning in PARDISO incomplete multilevel factorization with inverse-based pivoting stabilized by symmetric-weighted matchings Optimality tolerance: 1e-8

CUTEr and COPS collections 745 problems written in AMPL 645 solved successfully 42 real failures Robustness between 87%-94% Original IPOPT: 93%

Helmholtz N n p q # iter CPU sec (per iter) 32 14724 13824 1800 37 807.823 (21.833) 64 56860 53016 7688 25 3741.42 (149.66) 128 227940 212064 31752 20 54581.8 (2729.1)

Helmholtz Not taking nonconvexity into account:

Boundary control min 1 2 Ω (y(x) yt(x))2 dx s.t. (e y(x) y(x)) = 20 in Ω y(x) = u(x) on Ω 2.5 u(x) 3.5 on Ω where y t(x) = 3 + 10x 1 (x 1 1)x 2 (x 2 1) sin(2πx 3 ) N n p q # iter CPU sec (per iter) 16 4096 2744 2704 13 2.8144 (0.2165) 32 32768 27000 11536 13 103.65 (7.9731) 64 262144 238328 47632 14 5332.3 (380.88) Original IPOPT with N = 32 requires 238 seconds per iteration

Hyperthermia Treatment Planning min 1 2 Ω (y(x) yt(x))2 dx s.t. y(x) 10(y(x) 37) = u M(x)u in Ω 37.0 y(x) 37.5 on Ω 42.0 y(x) 44.0 in Ω 0 where u j = a j e iφ j, M jk (x) =< E j (x), E k (x) >, E j = sin(jx 1 x 2 x 3 π) N n p q # iter CPU sec (per iter) 16 4116 2744 2994 68 22.893 (0.3367) 32 32788 27000 13034 51 3055.9 (59.920) Original IPOPT with N = 32 requires 408 seconds per iteration

Groundwater modeling min 1 2 Ω (y(x) yt(x))2 dx + 1 2 α Ω [β(u(x) ut(x))2 + (u(x) u t(x)) 2 ]dx s.t. (e u(x) y i (x)) = q i (x) in Ω, i = 1,..., 6 y i (x) n = 0 on Ω y i (x)dx = 0, i = 1,..., 6 Ω 1 u(x) 2 in Ω where q i = 100 sin(2πx 1 ) sin(2πx 2 ) sin(2πx 3 ) N n p q # iter CPU sec (per iter) 16 28672 24576 8192 18 206.416 (11.4676) 32 229376 196608 65536 20 1963.64 (98.1820) 64 1835008 1572864 524288 21 134418. (6400.85) Original IPOPT with N = 32 requires approx. 20 hours for the first iteration

Outline PDE-Constrained Optimization Newton s method Inexactness Experimental results Conclusion and final remarks

Conclusion and final remarks PDE-Constrained optimization is an active and exciting area Inexact Newton method with theoretical foundation Convergence guarantees are as good as exact methods, sometimes better Numerical experiments are promising so far, and more to come