Lecture 11 and 12: Penalty methods and augmented Lagrangian methods for nonlinear programming Coralia Cartis, Mathematical Institute, University of Oxford C6.2/B2: Continuous Optimization Lecture 11 and 12: Penalty methods and augmented Lagrangian methods for nonlinear programming p. 1/25
Penalty methods for nonlinear programming Lecture 11 and 12: Penalty methods and augmented Lagrangian methods for nonlinear programming p. 2/25
Nonlinear equality-constrained problems min f(x) subject to c(x) = 0, (ecp) x R n where f : R n R, c = (c 1,...,c m ) : R n R m smooth. attempt to find local solutions (at least KKT points). constrained optimization conflict of requirements: objective minimization & feasibility of the solution. easier to generate feasible iterates for linear equality and general inequality constrained problems; very hard, even impossible, in general, when general equality constraints are present. = form a single, parametrized and unconstrained objective, whose minimizers approach initial problem solutions as parameters vary Lecture 11 and 12: Penalty methods and augmented Lagrangian methods for nonlinear programming p. 3/25
A penalty function for (ecp) min f(x) subject to c(x) = 0. (ecp) x R n The quadratic penalty function: min Φ σ (x) = f(x) + 1 x R n 2σ c(x) 2, (ecp σ ) where σ > 0 penalty parameter. σ: penalty on infeasibility; σ 0: forces constraint to be satisfied and achieve optimality for f. Φ σ may have other stationary points that are not solutions for (ecp); eg., when c(x) = 0 is inconsistent. Lecture 11 and 12: Penalty methods and augmented Lagrangian methods for nonlinear programming p. 4/25
Contours of the penalty function Φ σ - an example 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 σ = 100 σ = 1 The quadratic penalty function for minx 2 1 + x2 2 subject to x 1 + x 2 2 = 1 Lecture 11 and 12: Penalty methods and augmented Lagrangian methods for nonlinear programming p. 5/25
Contours of the penalty function Φ σ - an example... 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 σ = 0.1 σ = 0.01 The quadratic penalty function for minx 2 1 + x2 2 subject to x 1 + x 2 2 = 1 Lecture 11 and 12: Penalty methods and augmented Lagrangian methods for nonlinear programming p. 6/25
A quadratic penalty method Given σ 0 > 0, let k = 0. Until convergence do: Choose 0 < σ k+1 < σ k. Starting from x k 0 (possibly, xk 0 := xk ), use an unconstrained minimization algorithm to find an approximate minimizer x k+1 of Φ σ k+1. Let k := k + 1. Must have σ k 0, k 0. σ k+1 := 0.1σ k, σ k+1 := (σ k ) 2, etc. Algorithms for minimizing Φ σ : Linesearch, trust-region methods. σ small: Φ σ very steep in the direction of constraints gradients, and so rapid change in Φ σ for steps in such directions; implications for shape of trust region. Lecture 11 and 12: Penalty methods and augmented Lagrangian methods for nonlinear programming p. 7/25
A convergence result for the penalty method Theorem 25. (Global convergence of penalty method) Apply the basic quadratic penalty method to the (ecp). Assume that f,c C 1, y k i = c i(x k )/σ k, i = 1,m, and Φ σ k(x k ) ǫ k, where ǫ k 0,k, and also σ k 0, as k. Moreover, assume that x k x, where c i (x ), i = 1,m, are linearly independent. Then x is a KKT point of (ecp) and y k y, where y is the vector of Lagrange multipliers of (ecp) constraints. c i (x ), i = 1,m, lin. indep. the Jacobian matrix J(x ) of the constraints is full row rank and so m n. J(x ) not full rank, then x (locally) minimizes the infeasibility c(x). [let y k in ( ) on the next slide] Lecture 11 and 12: Penalty methods and augmented Lagrangian methods for nonlinear programming p. 8/25
A convergence result for the penalty method Proof of Theorem 25. J(x ) full rank = J(x ) + = (J(x )J(x ) T ) 1 J(x ) pseudo-inverse. As x k x and J cont. J(x k ) + bounded above and cont. for all suff. large k. Let y k = c(x k )/σ k and y = J(x ) + f(x ). Φ σ k(x k ) = f(x k ) J(x k ) T y k ǫ k ( ) J(x k ) + f(x k ) y k = J(x k ) + ( f(x k ) J(x k ) T y k ) J(x k ) + f(x k ) J(x k ) T y k { J(x k ) + J(x ) + + J(x ) + } ǫ k 2 J(x ) + ǫ k ( ) where in the last we used x k x and J + continuous. Triangle inequality (add and subtr J + f) and def of y give y k y J(x k ) + f(x k ) J(x ) + f(x ) + J(x k ) + f(x k ) y k Thus y k y since x k x, J + and f cont., ( ) and ǫ k 0. Using all these again in ( ) as k : f(x ) J(x ) T y = 0. As c(x k ) = σ k y k, σ k 0, y k y c(x ) = 0. Thus x KKT. Lecture 11 and 12: Penalty methods and augmented Lagrangian methods for nonlinear programming p. 9/25
Derivatives of the penalty function Let y(σ) := c(x)/σ: estimates of Lagrange multipliers. Let L be the Lagrangian function of (ecp), Φ σ (x) = f(x) + 1 2σ c(x) 2. L(x,y) := f(x) y T c(x). Then Φ σ (x) = f(x) + 1 σ J(x)T c(x) = x L(x,y(σ)), where J(x) Jacobian m n matrix of constraints c(x). 2 Φ σ (x) = 2 f(x) + 1 σ m i=1 c i(x) 2 c i (x) + 1 σ J(x)T J(x) = 2 xx L(x,y(σ)) + 1 σ J(x)T J(x). σ 0: generally, c i (x) 0 at the same rate with σ for all i. Thus usually, 2 xxl(x,y(σ)) well-behaved. σ 0: J(x) T J(x)/σ J(x ) T J(x )/0 =. Lecture 11 and 12: Penalty methods and augmented Lagrangian methods for nonlinear programming p. 10/25
Ill-conditioning of the penalty s Hessian... Fact [cf. Th 5.2, Gould ref.] = m eigenvalues of 2 Φ σ k(x k ) are O(1/σ k ) and hence, tend to infinity as k (ie, σ k 0); remaining n m are O(1) in the limit. Hence, the condition number of 2 Φ σ k(x k ) is O(1/σ k ) = it blows up as k. = worried that we may not be able to compute changes to x k accurately. Namely, whether using linesearch or trust-region methods, asymptotically, we want to minimize Φ σ k+1(x) by taking Newton steps, i.e., solve the system 2 Φ σ (x)dx = Φ σ (x), (*) for dx from some current x = x k,i and σ = σ k+1. Despite ill-conditioning present, we can still solve for dx accurately! Lecture 11 and 12: Penalty methods and augmented Lagrangian methods for nonlinear programming p. 11/25
Solving accurately for the Newton direction Due to computed formulas for derivatives, (*) is equivalent to ( 2 xx L(x,y(σ)) + 1 σ J(x)T J(x) ) dx = ( f(x) + 1 σ J(x)T c(x) ), where y(σ) = c(x)/σ. Define auxiliary variable w w = 1 σ (J(x)dx + c(x)). Then the Newton system (*) can be re-written as ( 2 L(x,y(σ)) J(x) J(x) σi )( dx w ) = ( ) f(x) c(x) This system is essentially independent of σ for small σ = cannot suffer from ill-conditioning due to σ 0. Still need to be careful about minimizing Φ σ for small σ. Eg, when using TR methods, use dx B for TR constraint. B takes into account ill-conditioned terms of Hessian so as to encourage equal model decrease in all directions. Lecture 11 and 12: Penalty methods and augmented Lagrangian methods for nonlinear programming p. 12/25
Perturbed optimality conditions min f(x) subject to c(x) = 0. (ecp) x R n (ecp) satisfies the KKT conditions (dual feasibility) f(x) = J(x) T y and (primal feasibility) c(x) = 0. Consider the perturbed problem f(x) J(x) T y = 0 c(x)+σy = 0 (ecp p ) Find roots of nonlinear system (ecp p ) as σ 0 (σ > 0); use Newton s method for root finding. Lecture 11 and 12: Penalty methods and augmented Lagrangian methods for nonlinear programming p. 13/25
Perturbed optimality conditions... Newton s method for system (ecp p ) computes change (dx,dy) to (x,y) from 2 L(x,y) J(x) = J(x) σi dx dy f(x) J(x) y c(x) + σy Eliminating dy, gives ( 2 xx L(x,y) + 1 ) ( σ J(x)T J(x) dx = f(x) + 1 ) σ J(x)T c(x) = same as Newton for quadratic penalty! what s different? Lecture 11 and 12: Penalty methods and augmented Lagrangian methods for nonlinear programming p. 14/25
Perturbed optimality conditions... Primal: ( 2 xx L(x,y(σ)) + 1 ) ( σ J(x)T J(x) dx p = f(x) + 1 ) σ J(x)T c(x) where y(σ) = c(x)/σ. Primal-dual: ( 2 xx L(x,y) + 1 ) ( σ J(x)T J(x) dx pd = f(x) + 1 ) σ J(x)T c(x) The difference is in freedom to choose y in 2 L(x,y) in primal-dual methods - it makes a big difference computationally. Lecture 11 and 12: Penalty methods and augmented Lagrangian methods for nonlinear programming p. 15/25
Other penalty functions Consider the general (CP) problem minimize x R n f(x) subject to c E (x) = 0, c I (x) 0. (CP) Exact penalty function: Φ(x,σ) is exact if there is σ > 0 such that if σ < σ, any local solution of (CP) is a local minimizer of Φ(x, σ). (Quadratic penalty is inexact.) Examples: l 2 -penalty function: Φ(x,σ) = f(x) + 1 σ c E(x) l 1 -penalty function: let z = min{z,0}, Φ(x,σ) = f(x) + 1 σ i E c i(x) + 1 σ Extension of quadratic penalty to (CP): Φ(x,σ) = f(x) + 1 2σ c E(x) 2 + 1 2σ i I i I [c i(x)]. ( [ci (x)] ) 2 (may no longer be suff. smooth; it is inexact) Lecture 11 and 12: Penalty methods and augmented Lagrangian methods for nonlinear programming p. 16/25
Augmented Lagrangian methods for nonlinear programming Lecture 11 and 12: Penalty methods and augmented Lagrangian methods for nonlinear programming p. 17/25
Nonlinear equality-constrained problems min f(x) subject to c(x) = 0, (ecp) x R n where f : R n R, c = (c 1,...,c m ) : R n R m smooth. Another example of merit function and method for (ecp): augmented Lagragian function Φ(x,u,σ) = f(x) u T c(x) + 1 2σ c(x) 2 where u R m and σ > 0 are auxiliary parameters. Two interpretations: shifted quadratic penalty function convexification of the Lagrangian function Aim: adjust u and σ to encourage convergence. Lecture 11 and 12: Penalty methods and augmented Lagrangian methods for nonlinear programming p. 18/25
Derivatives of the augmented Lagrangian function Let J(x) Jacobian of constraints c(x) = (c 1 (x),...,c m (x)). x Φ(x,u,σ) = f(x) J(x) T u + 1 σ J(x)T c(x) = x Φ(x,u,σ) = f(x) J(x) T y(x) = x L(x,y(x)) where y(x) = u c(x) σ Lagrange multiplier estimates 2 Φ(x,u,σ) = 2 f(x) m i=1 u i 2 c i (x)+ 1 σ m i=1 c i(x) 2 c i (x) + 1 σ J(x)T J(x) = 2 Φ(x,u,σ) = 2 f(x) m i=1 y i 2 c i (x) + 1 σ J(x)T J(x) = 2 Φ(x,u,σ) = 2 L(x,y(x)) + 1 σ J(x)T J(x) Lagrangian: L(x,y) = f(x) y T c(x) Lecture 11 and 12: Penalty methods and augmented Lagrangian methods for nonlinear programming p. 19/25
A convergence result for the augmented Lagrangian Theorem 26. (Global convergence of augmented Lagrangian) Assume that f,c C 1 in (ecp) and let y k = u k c(xk ) σ k, for given u k R m, and assume that Φ(x k,u k,σ k ) ǫ k, where ǫ k 0,k. Moreover, assume that x k x, where c i (x ), i = 1,m, are linearly independent. Then y k y as k with y satisfying f(x ) J(x ) T y = 0. If additionally, either σ k 0 for bounded u k or u k y for bounded σ k then x is a KKT point of (ecp) with associated Lagrange multipliers y. Lecture 11 and 12: Penalty methods and augmented Lagrangian methods for nonlinear programming p. 20/25
A convergence result for the augmented Lagrangian Proof of Theorem 26. The first part of Th 26, namely, convergence of y k to y = J(x ) + f(x ) follows exactly as in the proof of Theorem 25 (penalty method convergence). (Note that the assumption σ k 0 is not needed for this part of the proof of Th 25.) It remains to show that under the additional assumptions on u k and σ k, x is feasible for the constraints. To see this, use the definition of y k to deduce c(x k ) = σ k (u k y k ) and so c(x k ) = σ k u k y k σ k y k y + σ k u k y Thus c(x k ) 0 as k due to y k y (cf. first part of theorem) and the additional assumptions on u k and σ k. As x k x and c is continuous, we deduce that c(x ) = 0. Note that Augmented Lagrangian may converge to KKT points without σ k 0, which limits the ill-conditioning. Lecture 11 and 12: Penalty methods and augmented Lagrangian methods for nonlinear programming p. 21/25
Contours of the augmented Lagrangian - an example 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 u = 0.5 u = 0.9 The augmented Lagrangian function for minx 2 1 + x2 2 = 1 for fixed σ = 1 x 1 + x 2 2 subject to Lecture 11 and 12: Penalty methods and augmented Lagrangian methods for nonlinear programming p. 22/25
Contours of the augmented Lagrangian - an example... 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 u = 0.99 u = y = 1 The augmented Lagrangian function for minx 2 1 + x2 2 = 1 for fixed σ = 1 x 1 + x 2 2 subject to Lecture 11 and 12: Penalty methods and augmented Lagrangian methods for nonlinear programming p. 23/25
Augmented Lagrangian methods Th 26 = convergence guaranteed if u k fixed and σ k 0 [similar to quadratic penalty methods] = y k y and c(x k ) 0 check if c(x k ) η k where η k 0 if so, set u k+1 = y k and σ k+1 = σ k [recall expression of y k in Th 26] if not, set u k+1 = u k and σ k+1 τσ k for some τ (0,1) reasonable: η k = (σ k ) 0.1+0.9j where j iterations since σ k last changed Under such rules, can ensure that σ k is eventually unchanged under modest assumptions, and (fast) linear convergence. Need also to ensure that σ k is sufficiently large that the Hessian 2 Φ(x k,u k,σ k ) is positive (semi-)definite. Lecture 11 and 12: Penalty methods and augmented Lagrangian methods for nonlinear programming p. 24/25
A basic augmented Lagrangian method Given σ 0 > 0 and u 0, let k = 0. convergence do: Until Set η k and ǫ k+1. If c(x k ) η k, set u k+1 = y k and σ k+1 = σ k. Otherwise, set u k+1 = u k and σ k+1 τσ k. Starting from x k 0 (possibly, xk 0 := xk ), use an unconstrained minimization algorithm to find an approximate minimizer x k+1 of Φ(,u k+1,σ k+1 ) for which x Φ(x k+1,u k+1,σ k+1 ) ǫ k+1. Let k := k + 1. Often choose τ = min(0.1, σ k ) Reasonable: ǫ k = (σ k ) j+1, where j iterations since σ k last changed Lecture 11 and 12: Penalty methods and augmented Lagrangian methods for nonlinear programming p. 25/25