Part 5: Penalty and augmented Lagrangian methods for equality constrained optimization Nick Gould (RAL) x IR n f(x) subject to c(x) = Part C course on continuoue optimization CONSTRAINED MINIMIZATION x IR n f(x) subject to c(x) { } = where the objective function f : IR n IR and the constraints c : IR n IR m assume that f, c C (sometimes C 2 ) and Lipschitz often in practice this assumption violated, but not necessary
CONSTRAINTS AND MERIT FUNCTIONS Two conflicting goals: the objective function f(x) satisfy the constraints Overcome this by minimizing a composite merit function Φ(x, p) for which p are parameters (some) rs of Φ(x, p) wrt x approach those of f(x) subject to the constraints as p approaches some set P only uses unconstrained minimization methods AN EXAMPLE FOR EQUALITY CONSTRAINTS x IR n f(x) subject to c(x) = Merit function (quadratic penalty function): Φ(x, µ) = f(x) + 2µ c(x) 2 2 required solution as µ approaches {} from above may have other useless stationary points
CONTOURS OF THE PENALTY FUNCTION.9.9.8.8.7.7.6.6.5.5.4.4.3.3.2.2....2.3.4.5.6.7.8.9..2.3.4.5.6.7.8.9 µ = µ = Quadratic penalty function for min x 2 + x 2 2 subject to x + x 2 2 = CONTOURS OF THE PENALTY FUNCTION (cont.).9.9.8.8.7.7.6.6.5.5.4.4.3.3.2.2....2.3.4.5.6.7.8.9..2.3.4.5.6.7.8.9 µ =. µ =. Quadratic penalty function for min x 2 + x 2 2 subject to x + x 2 2 =
BASIC QUADRATIC PENALTY FUNCTION ALGORITHM Given µ >, set k = Until convergence iterate: Starting from x S k, use an unconstrained minimization algorithm to find an approximate r x k of Φ(x, µ k ) Compute µ k+ > smaller than µ k such that lim k µ k+ = and increase k by often choose µ k+ =.µ k or even µ k+ = µ 2 k might choose x S k+ = x k MAIN CONVERGENCE RESULT Theorem 5.. Suppose that f, c C 2, that that y k def = c(x k) µ k, x Φ(x k, µ k ) 2 ɛ k, where ɛ k converges to zero as k, and that x k converges to x for which A(x ) is full rank. Then x satisfies the first-order necessary optimality conditions for the problem x IR n f(x) subject to c(x) = and {y k } converge to the associated Lagrange multipliers y.
PROOF OF THEOREM 5. Generalized inv. A + (x) def = ( A(x)A T (x) ) A(x) bounded near x. Define def y k = c(x k) def and y = A + (x )g(x ). () µ k Inner-iteration termination rule g(x k ) A T (x k )y k ɛ k (2) = A + (x k )g(x k ) y k 2 = A + (x k ) ( ) g(x k ) A T (x k )y 2 k 2 A + (x ) 2 ɛ k = y k y 2 A + (x )g(x ) A + (x k )g(x k ) 2 + A + (x k )g(x k ) y k 2 = {y k } y. Continuity of gradients + (2) = g(x ) A T (x )y =. () implies c(x k ) = µ k y k + continuity of constraints = c(x ) =. = (x, y ) satisfies the first-order optimality conditions. ALGORITHMS TO MINIMIZE Φ(x, µ) Can use linesearch methods might use specialized linesearch to cope with large quadratic term c(x) 2 2/2µ trust-region methods (ideally) need to shape trust region to cope with contours of the c(x) 2 2/2µ term
DERIVATIVES OF THE QUADRATIC PENALTY FUNCTION x Φ(x, µ) = g(x, y(x)) xx Φ(x, µ) = H(x, y(x)) + µ AT (x)a(x) where Lagrange multiplier estimates: y(x) = c(x) µ g(x, y(x)) = g(x) A T (x)y(x): gradient of the Lagrangian m H(x, y(x)) = H(x) y i (x)h i (x): Lagrangian Hessian i= GENERIC QUADRATIC PENALTY NEWTON SYSTEM Newton correction s from x for quadratic penalty function is ( H(x, y(x)) + ) µ AT (x)a(x) s = g(x, y(x)) LIMITING DERIVATIVES OF Φ For small µ: roughly x Φ(x, µ) = g(x) A T (x)y(x) } {{ } moderate xx Φ(x, µ) = H(x, y(x)) }{{} moderate + µ AT (x)a(x) }{{} µ AT (x)a(x) large
POTENTIAL DIFFICULTY Ill-conditioning of the Hessian of the penalty function: roughly speaking (non-degenerate case) m eigenvalues λ i [ A T (x)a(x) ] /µ k n m eigenvalues λ i [ S T (x)h(x, y )S(x) ] where S(x) orthogonal basis for null-space of A(x) = condition number of xx Φ(x k, µ k ) = O(/µ k ) = may not be able to find r easily THE ILL-CONDITIONING IS BENIGN Newton system: ( H(x, y(x)) + ) ( µ AT (x)a(x) s = g(x) + ) µ AT (x)c(x) Define auxiliary variables w = (A(x)s + c(x)) µ = ( H(x, y(x)) A T (x) A(x) µi ) ( ) ( ) s g(x) = w c(x) essentially independent of µ for small µ = no inherent ill-conditioning thus can solve Newton equations accurately more sophisticated analysis = original system OK
PERTURBED OPTIMALITY CONDITIONS First order optimality conditions for are: x IR n f(x) subject to c(x) = g(x) A T (x)y = c(x) = dual feasibility primal feasibility Consider the perturbed problem g(x) A T (x)y = dual feasibility c(x) + µy = perturbed primal feasibility where µ > PRIMAL-DUAL PATH-FOLLOWING METHODS Track roots of g(x) A T (x)y = and c(x) + µy = as < µ nonlinear system = use Newton s method Newton correction (s, v) to (x, y) satisfies ( ) ( ) ( H(x, y) A T (x) s g(x) A T (x)y = A(x) µi v c(x) + µy ) Eliminate ( w = H(x, y) + ) ( µ AT (x)a(x) s = g(x) + ) µ AT (x)c(x) c.f. Newton method for quadratic penalty function minimization!
PRIMAL VS. PRIMAL-DUAL Primal: ( H(x, y(x)) + ) µ AT (x)a(x) s P = g(x, y(x)) Primal-dual: ( H(x, y) + ) µ AT (x)a(x) s PD = g(x, y(x)) where y(x) = c(x) µ What is the difference? freedom to choose y in H(x, y) for primal-dual... vital ANOTHER EXAMPLE FOR EQUALITY CONSTRAINTS x IR n f(x) subject to c(x) = Merit function (augmented Lagrangian function): Φ(x, u, µ) = f(x) u T c(x) + 2µ c(x) 2 2 where u and µ are auxiliary parameters Two interpretations shifted quadratic penalty function convexification of the Lagrangian function Aim: adjust µ and u to encourage convergence
DERIVATIVES OF THE AUGMENTED LAGRANGIAN FUNCTION x Φ(x, u, µ) = g(x, y F (x)) xx Φ(x, u, µ) = H(x, y F (x)) + µ AT (x)a(x) where First-order Lagrange multiplier estimates: y F (x) = u c(x) µ g(x, y F (x)) = g(x) A T (x)y F (x): gradient of the Lagrangian m H(x, y F (x)) = H(x) yi F (x)h i (x): Lagrangian Hessian i= AUGMENTED LAGRANGIAN CONVERGENCE Theorem 5.2. Suppose that f, c C 2, that def y k = u k c(x k )/µ k, for given {u k }, that x Φ(x k, u k, µ k ) 2 ɛ k, where ɛ k converges to zero as k, and that x k converges to x for which A(x ) is full rank. Then {y k } converge to some y for which g(x ) = A T (x )y. If additionally either µ k converges to zero for bounded u k or u k converges to y for bounded µ k, x and y satisfy the first-order necessary optimality conditions for the problem x IR n f(x) subject to c(x) =
PROOF OF THEOREM 5.2 Convergence of y k to y def = A + (x )g(x ) for which g(x ) = A T (x )y is exactly as for Theorem 5.. Definition of y k = c(x k ) = µ k u k y k µ k y k y + µ k u k y = c(x ) = from assumptions. = (x, y ) satisfies the first-order optimality conditions. CONTOURS OF THE AUGMENTED LAGRANGIAN FUNCTION.9.9.8.8.7.7.6.6.5.5.4.4.3.3.2.2....2.3.4.5.6.7.8.9..2.3.4.5.6.7.8.9 u =.5 u =.9 Augmented Lagrangian function for min x 2 + x 2 2 subject to x + x 2 2 = with fixed µ =
CONTOURS OF THE AUGMENTED LAGRANGIAN FUNCTION (cont.).9.9.8.8.7.7.6.6.5.5.4.4.3.3.2.2....2.3.4.5.6.7.8.9..2.3.4.5.6.7.8.9 u =.99 u = y = Augmented Lagrangian function for min x 2 + x 2 2 subject to x + x 2 2 = with fixed µ = CONVERGENCE OF AUGMENTED LAGRANGIAN METHODS convergence guaranteed if u k fixed and µ = y k y and c(x k ) check if c(x k ) η k where {η k } if so, set u k+ = y k and µ k+ = µ k if not, set u k+ = u k and µ k+ τµ k for some τ (, ) reasonable: η k = µ.+.9j k where j iterations since µ k last changed under such rules, can ensure µ k eventually unchanged under modest assumptions and (fast) linear convergence need also to ensure µ k is sufficiently large that xx Φ(x k, u k, µ k ) is positive (semi-)definite
BASIC AUGMENTED LAGRANGIAN ALGORITHM Given µ > and u, set k = Until convergence iterate: Starting from x S k, use an unconstrained minimization algorithm to find an approximate r x k of Φ(x, u k, µ k ) for which x Φ(x k, u k, µ k ) ɛ k If c(x k ) η k, set u k+ = y k and µ k+ = µ k Otherwise set u k+ = u k and µ k+ τµ k Set suitable ɛ k+ and η k+ and increase k by often choose τ = min(., µ k ) might choose x S k+ = x k reasonable: ɛ k = µ j+ k where j iterations since µ k last changed