4TE3/6TE3 Algorithms for Continuous Optimization (Duality in Nonlinear Optimization ) Tamás TERLAKY Computing and Software McMaster University Hamilton, January 2004 terlaky@mcmaster.ca Tel: 27780
Optimality conditions for constrained convex optimization problems (CO) min f(x) s.t. g j (x) 0, x C. j = 1,, m F = {x C g j (x) 0, j J}. Let C 0 denote the relative interior of the convex set C. Definition 1 A vector (point) x 0 C 0 is called a Slater point of (CO) if g j (x 0 ) < 0, g j (x 0 ) 0, for all j where g j is nonlinear, for all j where g j is linear. (CO) is Slater regular or (CO) satisfies the Slater condition (in other words, (CO) satisfies the Slater constraint qualification). 1
Ideal Slater Point Some constraint functions g j (x) might take the value zero for all feasible points. Such constraints are called singular while the others are called regular. J s = {j J g j (x) = 0 for all x F}, J r = J J s = {j J g j (x) < 0 for some x F}. Remark: Note, that if (CO) is Slater regular, then all singular functions must be linear. Definition 2 A vector (point) x C 0 is called an Ideal Slater point of the convex optimization problem (CO), if x is a Slater point and g j (x ) < 0 for all j J r, g j (x ) = 0 for all j J s. Lemma 1 If the problem (CO) is Slater regular then there exists an ideal Slater point x F. 2
Convex Farkas Lemma Theorem 1 Let U IR n be a convex set and a point w IR n with w / U be given. Then there is a separating hyperplane a T w α, with a IR n, α IR such that a T u α for all u U but U is not a subset of the hyperplane {x a T x = α}. Note that the last property says that there is a u U such that a T u > α. Lemma 2 (Farkas) The convex optimization problem (CO) is given and we assume that the Slater regularity condition is satisfied. The inequality system f(x) < 0 g j (x) 0, x C. j = 1,, m (1) has no solution if and only if there exists a vector y = (y 1,, y m ) 0 such that f(x) + y j g j (x) 0 for all x C. (2) The systems (1) and (2) are called alternative systems, i.e. exactly one of them has a solution. 3
Proof of the Convex Farkas Lemma If (1) is solvable, then (2) cannot hold. To prove the other side: let us assume that (1) has no solution. With u = (u 0,, u m ), we define the set U IR m+1 U = {u x C with u 0 > f(x), u j g j (x) if j J r, u j = g j (x) if j J s }. U is convex (note that due to the Slater condition singular functions are linear) due to the infeasibility of (1) it does not contain the origin. Due to the separation Theorem 1 there exists a separating hyperplane defined by (y 0, y 1,, y m ) and α = 0 such that y j u j 0 for all u U (3) j=0 and for some u U one has j=0 y j u j > 0. (4) The rest of the proof is divided into four parts. I. First we prove that y 0 0 and y j 0 for all j J r. II. Secondly we establish that (3) holds for u = (f(x), g 1 (x),, g m (x)) if x C. III. Then we prove that y 0 must be positive. IV. Finally, it is shown by using induction that y j > 0 for all j J s can be reached. 4
Proof: Steps I and II I. First we show that y 0 0 and y j 0 for all j J r. Let us assume that y 0 < 0. Let us take an arbitrary (u 0, u 1,, u m ) U. By definition (u 0 + λ, u 1,, u m ) U for all λ 0. Hence by (3) one has λy 0 + j=0 y j u j 0 for all λ 0. For sufficiently large λ the left hand side is negative, which is a contradiction, i.e. y 0 must be nonnegative. The proof of the nonnegativity of all y j as j J r goes analogously. II. Secondly we establish that y 0 f(x) + y j g j (x) 0 for all x C. (5) This follows from the observation that for all x C and for all λ > 0 one has u = (f(x) + λ, g 1 (x),, g m (x)) U, thus y 0 (f(x) + λ) + y j g j (x) 0 for all x C. Taking the limit as λ 0 the claim follows. 5
Proof: Step III III. Thirdly we show that y 0 > 0. The proof is by contradiction. We already know that y 0 0. Let us assume to the contrary that y 0 = 0. Hence from (5) we have y j g j (x) + y j g j (x) = y j g j (x) 0 for all x C. j J r j J s Taking an ideal Slater point x C 0 one has whence g j (x ) = 0 if j J s, j J r y j g j (x ) 0. Since y j 0 and g j (x ) < 0 for all j J r, this implies y j = 0 for all j J r. This results in j J s y j g j (x) 0 for all x C. (6) Now, from (4), with x C such that u j = g j ( x) if i J s we have j J s y j g j ( x) > 0. (7) Because the ideal Slater point x is in the relative interior of C there exist a vector x C and 0 < λ < 1 such that x = λ x + (1 λ) x. Using that g j (x ) = 0 for j J s and that the singular functions are linear one gets 6
Proof: Step III cntd. 0 = j J s y j g j (x ) = j J s y j g j (λ x + (1 λ) x) = λ j J s y j g j ( x) + (1 λ) j J s y j g j ( x) > (1 λ) j J s y j g j ( x). Here the last inequality follows from (7). The inequality (1 λ) j J s y j g j ( x) < 0 contradicts (6). Hence we have proved that y 0 > 0. At this point we have (5) with y 0 > 0 and y j 0 for all j J r. Dividing by y 0 > 0 in (5) and by defining y j := y j y 0 for all j J we obtain f(x) + y j g j (x) 0 for all x C. (8) We finally show that y may be taken such that y j > 0 for all j J s. 7
Proof: Step IV IV. To complete the proof we show by induction on the cardinality of J s that one can make y j nonnegative for all j J s. Observe that if J s = then we are done. If J s = 1 then we apply the results proved till this point to the inequality system g s (x) < 0, g j (x) 0, j J r, x C (9) where {s} = J s. The system (9) has no solution, it satisfies the Slater condition, and therefore there exists a ŷ IR m 1 such that g s (x) + j J r ŷ j g j (x) 0 for all x C, (10) where ŷ s > 0 and ŷ j 0 for all j J r. Adding a sufficiently large positive multiple of (10) to (8) one obtains a positive coefficient for g s (x). The general inductive step goes analogously. 8
Proof: Step IV cntd. Assuming that the result is proved if J s = k then the result is proved for the case J s = k+1. Let s J s then J s \ {s} = k, and hence the inductive assumption applies to the system g s (x) < 0 g j (x) 0, j J s \ {s}, g j (x) 0, j J r, x C (11) By construction the system (11) has no solution, it satisfies the Slater condition, and by the inductive assumption we have a ŷ IR m 1 such that g s (x) + ŷ j g j (x) 0 for all x C. (12) j J r J s \{s} where ŷ j > 0 for all j J s and ŷ j 0 for all j J r. Adding a sufficiently large multiple of (12) to (8) one have the desired nonnegative multipliers. Remark: Note, that finally we proved slightly more than was stated. We have proved that the multipliers of all the singular constraints can be made strictly positive.!!!! Srict complementarity in LO!!!! Goldman-Tucker Theorem 9
Karush Kuhn Tucker theory Langrangean, Saddle point The Lagrange function: L(x, y) := f(x) + y j g j (x) (13) where x C and y 0. Note that for fixed y the Lagrangean is convex in x; for fixed x. it is linear in y. Definition 3 A vector pair ( x, ȳ) IR n+m, x C and ȳ 0 is called a saddle point of the Lagrange function L if for all x C and y 0. L( x, y) L( x, ȳ) L(x, ȳ) (14) 10
A saddle point lemma Lemma 3 The vector ( x, ȳ) IR n+m, x C and ȳ 0 is a saddle point of L(x, y) if and only if inf sup L(x, y) = L( x, ȳ) = sup inf L(x, y). (15) x C y 0 y 0 x C Proof: The saddle point inequality (14) easily follows from (15). On the other hand, for any (ˆx, ŷ) one has inf L(x, ŷ) L(ˆx, ŷ) sup L(ˆx, y), x C y 0 hence one can take the supremum of the left hand side and the infimum of the right hand side resulting in sup y 0 inf L(x, y) inf sup L(x, y). (16) x C x C y 0 Using the saddle point inequality (14) one obtains inf sup L(x, y) sup L( x, y) L( x, ȳ) inf L(x, ȳ) supinf L(x, y). x C y 0 y 0 x C y 0 x C (17) Combining (17) and (16) the equality (15) follows. 11
Karush-Kuhn-Tucker Theorem Theorem 2 The problem (CO) is given. Assume that the Slater regularity condition is satisfied. The vector x is an optimal solution of (CO) if and only if there is a vector ȳ such that ( x, ȳ) is a saddle point of the Lagrange function L. Proof: Easy: if ( x, ȳ) is a saddle point of L(x, y) then x is optimal for (CO). The proof of this part does not need any regularity condition. From the saddle point inequality (14) one has f( x) + y j g j ( x) f( x) + ȳ j g j ( x) f(x) + ȳ j g j (x) for all y 0 and for all x C. From the first inequality g j ( x) 0 for all j = 1,, m follows, hence x F is feasible for (CO). Taking the two extreme sides of the above inequality and substituting y = 0 we have f( x) f(x) + for all x F, i.e. x is optimal. ȳ j g j (x) f(x) 12
KKT proof cntd. To prove the other direction we need Slater regularity and the Convex Farkas Lemma 2. Let us take an optimal solution x of the convex optimization problem (CO). Then the inequality system f(x) f( x) < 0 g j (x) 0, x C j = 1,, m is infeasible. Applying the Convex Farkas Lemma 2 one has ȳ 0 such that f(x) f( x) + ȳ j g j (x) 0 for all x C. Using that x is feasible one easily derive the saddle point inequality f( x) + y j g j ( x) f( x) f(x) + which completes the proof. ȳ j g j (x) 13
KKT Corollaries Corollary 1 Under the assumptions of Theorem 2 the vector x C is an optimal solution of (CO) if and only if there exists a ȳ 0 such that (i) (ii) m f( x) = min {f(x) + ȳ j g j (x)} x C m ȳ j g j ( x) = max { y j g j ( x)}. y 0 and Corollary 2 Under the assumptions of Theorem 2 the vector x F is an optimal solution of (CO) if and only if there exists a ȳ 0 such that (i) (ii) m f( x) = min {f(x) + ȳ j g j (x)} x C ȳ j g j ( x) = 0. and Corollary 3 Let us assume that C = IR n and the functions f, g 1,, g m are continuously differentiable functions. Under the assumptions of Theorem 2 the vector x F is an optimal solution of (CO) if and only if there exists a ȳ 0 such that (i) 0 = f( x) + ȳ j g j ( x) and (ii) ȳ j g j ( x) = 0. 14
KKT point Definition 4 Let us assume that C = IR n and the functions f, g 1,, g m are continuously differentiable functions. The vector ( x, ȳ) IR n+m is called a Karush Kuhn Tucker (KKT) point of (CO) if (i) g j ( x) 0, for all j J, (ii) 0 = f( x) + (iii) (iv) ȳ 0. ȳ j g j ( x) = 0, ȳ j g j ( x) Corollary 4 Let us assume that C = IR n and the functions f, g 1,, g m are continuously differentiable convex functions and the assumptions of Theorem 2 hold. Let the vector x F be a KKT point, then x is an optimal solution of (CO). 15
Duality in CO Lagrange dual Definition 5 Denote The problem m ψ(y) = inf {f(x) + y j g j (x)}. x C (LD) sup ψ(y) y 0 is called the Lagrange Dual of problem (CO). Lemma 4 The Lagrange Dual (LD) of (CO) is a convex optimization problem, even if the functions f, g 1,, g m are not convex. Proof: ψ(y) is concave! Let ȳ, ŷ 0 and 0 λ 1. m ψ(λȳ + (1 λ)ŷ) = inf {f(x) + (λȳ j + (1 λ)ŷ j )g j (x)} x C = inf x C {λ[f(x) + m inf x C {λ[f(x) + m = λψ(ȳ) + (1 λ)ψ(ŷ). ȳ j g j (x)] + (1 λ)[f(x) + ŷ j g j (x)]} ȳ j g j (x)]} + inf x C {(1 λ)[f(x) + m ŷ j g j (x)]} 16
Results on the Lagrange dual Theorem 3 (Weak duality) If x is a feasible solution of (CO) and ȳ 0 then ψ(ȳ) f( x) and equality hold if and only if m inf {f(x) + ȳ j g j (x)} = f( x). x C Proof: ψ(ȳ) = inf x C {f(x) + m ȳ j g j (x)} f( x) + ȳ j g j ( x) f( x). Equality holds iff inf x C {f(x)+ m ȳjg j (x)} = f( x) and hence ȳ j g j (x) = 0 for all j J. Corollary 5 If x is a feasible solution of (CO), ȳ 0 and ψ(ȳ) = f( x) then the vector x is an optimal solution of (CO) and ȳ is optimal for (LD). Further if the functions f, g 1,, g m are continuously differentiable then ( x, ȳ) is a KKT-point. Theorem 4 (Strong duality) Assume that (CO) satisfies the Slater regularity condition. Let x be a feasible solution of (CO). The vector x is an optimal solution of (CO) if and only if there exists a ȳ 0 such that ȳ is an optimal solution of (LD) and ψ(ȳ) = f( x). 17
Wolfe dual Definition 6 Assume that C = IR n and the functions f, g 1,, g m are continuously differentiable and convex. The problem (WD) sup{f(x) + f(x) + y j g j (x)} y j g j (x) = 0, y 0 is called the Wolfe Dual of the convex optimization problem (CO). Warning! Remember, we are only allowed to form the Wolfe dual of a nonlinear optimization problem if it is convex! For nonconvex problems one has to work with the Lagrange dual. 18
Examples for dual problems Linear optimization (LO) min{c T x Ax = b, x 0}. g j (x) = (a j ) T x b j if j = 1,, m; g j (x) = ( a j m ) T x + b j m if j = m + 1,,2m; g j (x) = x j 2m if j = 2m + 1,,2m + n. Denote Lagrange multipliers by y, y + and s, then the Wolfe dual (WD) of (LO) is: max c T x + (y ) T (Ax b) + (y + ) T ( Ax + b) + s T ( x) c + A T y A T y + s = 0, y 0, y + 0, s 0. Substitute c = A T y + A T y + + s and let y = y + y then max b T y A T y + s = c, s 0. Note that the KKT conditions provide the well known complementary slackness condition x T s = 0. 19
Quadratic optimization (QO) min{c T x + 1 2 xt Qx Ax b, x 0}. g j (x) = ( a j ) T x + b j if j = 1,, m; g j (x) = x j m if j = m + 1,, m + n. Lagrange multipliers: y and s. The Wolfe dual (WD) of (QO) is: max c T x + 1 2 xt Qx + y T ( Ax + b) + s T ( x) c + Qx A T y s = 0, y 0, s 0. Substitute c = Qx + A T y + s in the objective. max b T y 1 2 xt Qx Qx + A T y + s = c, y 0, s 0. Q = D T D (e.g. Cholesky), let z = Dx. The following (QD) dual problem is obtained: max b T y 1 2 zt z D T z + A T y + s = c, y 0, s 0. 20
Constrained maximum likelihood estimation Finite set of sample points x i, (1 i n). Find the most probable density values that satisfy some linear (e.g. convexity) constraints. Maximize the Likelihood function Π n i=1 x i under the conditions Ax 0, d T x = 1, x 0. The objective can equivalently replaced by n min ln x i. i=1 Lagrange multipliers: y IR m, t IR and s IR n. The Wolfe dual (WD) is: n max ln x i + y T ( Ax) + t(d T x 1) + s T ( x) i=1 X 1 e A T y + td s = 0, y 0, s 0. Multiplying the first constraint by x T one has x T X 1 e x T A T y + tx T d x T s = 0. Using d T x = 1, x T X 1 e = n and the optimality conditions y T Ax = 0, x T s = 0 we have t = n. x is necessarily strictly positive, hence the dual variable s must be zero at optimum. 21
Constrained maximum likelihood estimation: 2 max n i=1 ln x i X 1 e + A T y = nd, y 0. Eliminating the variables x i > 0: x i = 1 nd i a T i y and ln x i = ln(nd i a T i y) i. max n i=1 ln(nd i a T i y) A T y nd, y 0. 22
Example: positive duality gap Duffin s convex optimization problem (CO) min e x 2 s.t. x 2 1 + x2 2 x 1 0 x IR 2. The feasible region is F = {x IR 2 x 1 0, x 2 = 0}. (CO) is not Slater regular. The optimal value of the object function is 1. The Lagrange function is given by L(x, y) = e x 2 + y( x 2 1 + x2 2 x 1). Now, let ǫ = x 2 1 + x2 2 x 1, then x 2 2 2ǫx 1 ǫ 2 = 0. Hence, for any ǫ > 0 we can find x 1 > 0 such that ǫ = x 2 1 + x2 2 x 1 even if x 2 goes to infinity. However, when x 2 goes to infinity e x 2 goes to 0. So, ( ) ψ(y) = inf e x 2 + y x 2 x IR 2 1 + x2 2 x 1 = 0, 23
thus the optimal value of the Lagrange dual (LD) max ψ(y) s.t. y 0 is 0. Nonzero duality gap that equals to 1!
Example: infinite duality gap Duffin s example slightly modified min x 2 s.t. x 2 1 + x2 2 x 1 0. The feasible region is F = {x IR 2 x 1 0, x 2 = 0}. The problem is not Slater regular. The optimal value of the object function is 0. The Lagrange function is given by So, L(x, y) = x 2 + y( ψ(y) = inf x IR 2 { x 2 + y x 2 1 + x2 2 x 1). ( x 21 + x 22 x 1 )} =, thus the optimal value of the Lagrange dual (LD) max ψ(y) s.t. y 0 is, because ψ(y) is minus infinity! 24
Semidefinite optimization Let A 0, A 1,, A n IR m m, let c IR n, x IR n. The primal SDO problem is defined as (PSO) min c T x (18) n s.t. A 0 + A k x k 0, F(x) = A 0 + n k=1 A kx k The dual SDO problem is: k=1 (DSP) max Tr(A 0 Z) (19) s.t. Tr(A k Z) = c k, for all k = 1,, n, Z 0, Theorem 5 (Weak duality) If x IR n is primal feasible and Z IR m m is dual feasible, then c T x Tr(A 0 Z) with equality if and only if F(x)Z = 0. Proof: c T x Tr(A 0 Z) = n k=1 Tr(A k Z)x k Tr(A 0 Z) = Tr(( n k=1 A k x k A 0 )Z) = Tr(F(x)Z) 0. 25
SDO Dual as Lagrange Dual (PSO ) min c T x (20) s.t. F(x) + S = 0 S 0, The Lagrange function L(x, S, Z) is defined on {(x, S, Z) x IR n, S IR m m, S 0, Z IR m m, } L(x, S, Z) = c T x e T (F(x) Z)e + e T (S Z)e, X Z is the Minkowski product, e T (S Z)e = Tr(SZ). n L(x, S, Z) = c T x x k Tr(A k Z) + Tr(A 0 Z) + Tr(SZ). (21) k=1 (DSDL) max {ψ(z) Z IR m m } (22) ψ(z) = min{ L(x, S, Z) x IR n, S IR m m, S 0}. (23) Optimality conditions to get ψ(z): min S Tr(SZ) = { 0 if Z 0, otherwise. (24) 26
SDO dual cntd. If we minimize (23) in x, we need to equate the x gradient of L(x, S, Z) to zero: c k Tr(A k Z) = 0 for all k = 1,, n. (25) Multiplying (25) by x k and summing up c T x n k=1 x k Tr(A k Z) = 0. By combining the last formula and the results presented in (24) and in (25) the simplified form of the Lagrange dual (22), the Lagrange Wolfe dual (DSO) max Tr(A 0 Z) s.t. Tr(A k Z) = c k, for all k = 1,, n, Z 0, follows. It is identical to (19). 27
Duality in cone-linear optimization min c T x Ax b C 1 (26) x C 2. Definition 7 Let C IR n be a convex cone. The dual cone (polar, positive polar) C is defined by C := { z IR n x T z 0 for all x C}. The Dual of a Cone-linear Problem min c T x s Ax + b = 0 s C 1 x C 2. Linear equality constraints s Ax+b = 0 and (s, x) must be in the convex cone C 1 C 2 := {(s, x) s C 1, x C 2 }. The Lagrange function L(s, x, y) is defined on and is given by {(s, x, y) s C 1, x C 2, y IR m } L(s, x, y) = c T x + y T (s Ax + b) = b T y + s T y + x T (c A T y). 28
Thee Lagrange dual of the cone-linear problem where max y IR ψ(y) m ψ(y) = min{ L(s, x, y) s C 1, x C 2 }. The vector s is in the cone C 1, hence { min s T 0 if y C y = 1, s C 1 otherwise. min x T (c A T y) = x C 2 { 0 if c A T y C2, otherwise. By combining these: we have { b ψ(y) = T y if y C1 and c AT y C2, otherwise. Thus the dual of (26) is: max b T y c A T y C2 y C1. 29