COM S 672: Advanced Topics in Computational Models of Learning Optimization for Learning

COM S 672: Advanced Topics in Computational Models of Learning Optimization for Learning Lecture Note 4: Optimality Conditions Jia (Kevin) Liu Assistant Professor Department of Computer Science Iowa State University, Ames, Iowa, USA Fall 2017 JKL (CS@ISU) COM S 672: Lecture 4 1/20

Recap Last Lecture Given a minimization problem Minimize f(x) subject to g i (x) apple 0, i =1,,m u i 0 We define the Lagrangian: h j (x) =0, j =1,,p v j unconstrained L(x, u, v) =f(x)+ and the Lagrangian dual function: mx u i g i (x)+ i=1 (u, v) =min x L(x, u, v) px v j h j (x) j=1 JKL (CS@ISU) COM S 672: Lecture 4 2/20

Recap Last Lecture The subsequent Lagrangian dual problem is: Important properties: Maximize (u, v) subject to u 0 Dual problem is always convex (or is always concave), even if the primal problem is non-convex The weak duality property always hold, ie, the primal and dual optimal values p and d satisfy p d Slater s condition: for convex primal, if 9 and Bon x such that hptl ) f 1 (x) < 0,,f m (x) < 0 and h 1 (x) =0,,h(p)(x) pass =0 then strong duality holds: p = d JKL (CS@ISU) COM S 672: Lecture 4 3/20

Outline Today: KKT conditions Geometric interpretation Relevant examples in machine learning and other areas JKL (CS@ISU) COM S 672: Lecture 4 4/20

Karash-Kuhn-Tucker Conditions Given general problem Minimize f(x) subject to g i (x) apple 0, i =1,,m u i 0 h j (x) =0, j =1,,p v j unconstrained The Karash-Kuhn-Tucker (KKT) conditions are: Stationarity (ST): r x f(x)+ P m i=1 u ir x g i (x)+ P p j=1 v jr x h j (x) =0 = Complementary slackness (CS): u i g i (x) =0, 8i Primal feasibility (PF): f i apple 0, h j =0, 8i, j Dual feasibility (DF): u i 0, 8i for any given Ii either ui=o or gicn ) = o JKL (CS@ISU) COM S 672: Lecture 4 5/20

KKT Necessity Theorem 1 If x and u, v be primal and dual solutions w/ zero duality gap (eg, implied by Slater s condition), then (x, u, v ) satisfy KKT conditions Proof We have PF and DF for free from the assumption Also, x and (u, v ) are primal & dual solutions with strong duality ) strong duality f(x )= (u, v )=min x apple f(x )+ f(x)+ mx u i g i (x )+ i=1 mx u i g i (x)+ i=1 px vj h j (x) j=1 px vj h j (x ) apple f(x ) j=1 That is, all these inequalities are equalities Then: se primal opt y*, y* dual opt o DF PF } 54 ( z*, #, - ZU E PF go an - y*, being a KKT pt x minimizes L(x, u, v ) over x 2 R n (unconstrained) ) Gradient of L(x, u, v ) must be 0 at x,ie,thestationarity condition Since u i h i(x ) apple 0 (PF & DF), we must P m i=1 u i h i(x )=0,ie, complementary slackness condition JKL (CS@ISU) COM S 672: Lecture 4 6/20 o

KKT Su ciency CITY 't, It ) is KKT pt ft* ' 3 primal opt, e*iv* is dual opt, Theorem 2 If x and (u, v ) satisfy KKT conditions, then x and (u, v ) are primal and dual optimal solutions, respectively Proof If x and (u, v ) satisfy KKT conditions, then mx px (u, v ) (a) = f(x )+ u i g i (x )+ vj h j (x ) from CST ), a Hz*, y*, 1*1=0 It is a minimizer of (b) = f(x ), i=1 (g) j=1 - - = = 1PF) 0 the, uii* ) where (a) follows from ST and (b) follows from CS Therefore, the duality gap is zero Note that x and (u, v ) are PF and DF Hence, they are primal and dual optimal, respectively JKL (CS@ISU) COM S 672: Lecture 4 7/20

In Summary So putting things together Theorem 3 For a convex optimization problem with strong duality (eg, implied by Slater s conditions or other constraints qualifications): x and (u, v ) are primal and dual solutions () x and (u, v ) satisfy KKT conditions Warning: This statement is only true for convex optimization problems For non-convex optimization problems, KKT conditions are neither necessary nor su cient! (more on this shortly) JKL (CS@ISU) COM S 672: Lecture 4 8/20

Where Does this Name Come From? Older books/papers referred to this as the KT (Kuhn-Tucker) conditions First appeared in a publication by Kuhn and Tucker in 1951 Kuhn & Tucker shared the John von Neumann Theory Prize in 1980 Later people realized that Karush had the same conditions in his unpublished master s thesis in 1939 William Karush Harold W Kuhn Albert W Tucker A Fun Read: R W Cottle, William Karush and the KKT Theorem, Documenta Mathematica, 2012, pp 255-269 JKL (CS@ISU) COM S 672: Lecture 4 9/20

Other Optimality Conditions KKT conditions are a special case of the more general Fritz John Conditions: u 0 rf(x )+ where u 0 could be 0 mx u i rg i (x )+ i=1 px v j rh j (x )=0 In turn, Fritz John conditions (hence KKT) belong to a wider class of the first-order necessary conditions (FONC), which allow for non-smooth functions using subderivatives j=1 Further, there are a whole class second-order necessary & su conditiosn (SONC,SOSC) also in KKT style cient For an excellent treatment on optimality conditions, see [BSS, Ch4 Ch6] JKL (CS@ISU) COM S 672: Lecture 4 10 / 20

- - Geometric Interpretation of KKT 6TH at # I}*ff9iH*Hui= descent dir *feet " ) : IF " %- : xgitets,,, Seery 9*0%1/1 - - Contour gdek0 \ plane of of ohjtn the of binding constants at 't z ( aetivytight ) Physics interpretation : pulling force ' repelling force " * #s 'ffifai Eft te* E k C feet ) is contained in set the cone defined by the normal rectors binding const of HE )±{ i : gionitro }, JKL (CS@ISU) COM S 672: Lecture 4 11 / 20

convex Convex - When is KKT neither su cient nor necessary? (Not necc): x is a (local) minimum ; x is a KKT point gzpt ) is non - gets, % * agree ) t agent ' ) =\ magic # tuixgdat) ie#ee*ieeiia:t:tt*hi:iiii 91124<-0 (Not su ): x is a KKT point ; x is a (local) minimum get ) is non - point, must take uo=o ) - obj for : min EE 9411<-0 I* is KKT & pt : 7 U,, Uz 70 9%6*-9 :L such that e f gda*l-tdd set * = uasi im#a*s Simko But, z* is not opt :gdz*j eg EE<Ez*, # JKL (CS@ISU) COM S 672: Lecture 4 12 / 20

KKT Example 1: Quadratic Problems with Equality Constraints Consider for Q 0, the following quadratic programming problem is: uic e, 1 2 x> Qx + c > x subject to Ax = 0 u Lagrangian IE 'Q=z + Ee : * Minimize x tainted est : Qate apple Q A > A convex problem w/o inequality constraints By KKT, x is primal optimal i apple apple x c CPFI three = system : A 0 u 0 4 F) IS ) : Implied by (PF) for some dual variable u A linear equation system combines ST & PF (CS Let E±z- z*, and DF vacuous) Q= # tax, Hk E'QEtEE Often feyxfczitsftxfcnxy Aie#?tnoteaz*=b arises from using Newton s method to solved equality-constrained problems {min x equity C f(x) Ax = b} By Taylor 's E) soappannm HE- : + lx=z*5t±ee*c z*s +041199 Further, we want #gg=o JKL (CS@ISU) COM S 672: Lecture 4 13 / 20

- - yitdxitb - Imzttieii Ei Example 2: Support Vector Machine kf i#* *II * Given labels y 2 { 1, 1} n,featurevectorsx 1,,x m LetX, [x 1,,x m ] > Recall from Lecture 1 that the support vector machine problem: -5 Minimize w,b, 1 2 kwk2 + C mx i=1 subject to y i (w > x i + b) 1 i, i =1,,m i i 0, i =1,,m uiso fi Vigo, ti Introducing dual variables u, v 0 to obtain the KKT system: tllwlptc Lagrangian : Take deri : W nt w_ mx mx (ST): 0= u i y i, w = u i y i x i, u = C1 v i=1 i=1 (CS): v i i =0, u i 1 i y i (x > i w + b) =0, i =1,,m a ei,, uigizi =O time ui[, b Kei : TI want ) ] Qhffohmmtnkgmb, Vi=o Hi,uiyi=o, wort : C- ui - JKL (CS@ISU) COM S 672: Lecture 4 14 / 20?

- Example 2: Support Vector Machine p a =X= piag { y, Hence, at optimality, we have w = P m i=1 u iy i x i,andu i is nonzero only if y i (x > i w + b) =1 - i Such points are called the support points For support point i, if i =0,thenx i lies on the edge of margin and u i 2 (0,C] For support point i, if i 6=0,thenx i lies on wrong side of margin, and u i = C ym } 1=14 1 kwk 1 kwk 0 11 : 1 R D A 4 3 5 2 0 a O 2 Margin: kwk w > x + b =0 KKT conditions do not really give us a way to find solution here, but gives better understanding & useful in proofs In fact, we can use this to screen away nonsupport points before performing optimization (lower-complexity) JKL (CS@ISU) COM S 672: Lecture 4 15 / 20

great Example 3: Water-filling Example from [BV]: Consider the problem sbneuheraihan pie : i maxstogkitnis nx ga src s d Minimize log( i + x i ) st s f x i=1 subject to x 0, 1 > x =1, i =1,,m In Information Theory: log( i 470 var (normalised o t ) power constraint + x i ) is the communication rate of ith channel Introducing dual variables u, v 0 to obtain the KKT system: (ST): 1/( i + x i ) u i + v =0, i =1,,n (CS): u i x i =0, i =1,,n (PF): x 0, 1 > x =1 (DF): u 0,v unconstrained Eliminating u yields: implied by 1/( i + x i ) apple v, i =1,,n x i (v 1/( i + x i )) = 0, i =1,,n, x 0, 1 > x =1 Ui > 0 1 JKL (CS@ISU) COM S 672: Lecture 4 16 / 20

i Example 3: Water-filling ST and CS implies that: x i = 2 ( if ki=0 1/v i if v<1/ i =) x i = max{0, 1/v i }, i =1,,n 0 if v 1/ i Also, from PF, ie, 1 > x =1,wehave: nx max{0, 1/v i } =1 i=1 CCS ) CST ) 1 rf xi to ni = o v= I Litki VI # Univariate equation, piecewise linear in 1/v and not hard to solve This reduced problem is referred to as the water-filling solution (From [BV], pp 246) 1 : : *, a ;,!1 1 water Level JKL (CS@ISU) COM S 672: Lecture 4 17 / 20

- ' " fall Constrained and Lagrange Forms sut E E Often in ML and Stats, we ll switch back and forth between constrained form, where t 2 R is a tuning parameter and Lagrange form, where u (C): min f(x) subject to g(x) apple t age # x 0 is a tuning parameter (L): min x f(x)+u g(x) and claim these are equivalent Is this true (assuming f and g convex)? Proof (C) to (L): If Problem (C) is strictly feasible, then strong duality holds (why?), and there exists some u 0 (dual solution) such that any solution x in (C) minimizes # + ut txitkugc Clearly, x is also a solution in (L) f(x)+u g(x) t sparsity Regularized Optimization :C spoiler) * Tikhono regulation : mzhfcz ) ttycx ) * Morrow a " " ' : min 411 ) Slater 's JKL (CS@ISU) COM S 672: Lecture 4 18 / 20

Constrained and Lagrange Forms (L) to (C): If x is a solution in (L), then the KKT conditions for (C) are satisfied by taking t = g(x ),sox is a solution in (C) Putting things together: [ n o solutions in (L) u 0 [ u 0 n o solutions in (L) [ t [ tsuch that (C) is strictly feasible n o solutions in (C) n o solutions in (C) Ie, nearly perfect equivalence Note: If the only value of t that leads to a feasible but not strictly feasible constraint set is t =0, then we do get perfect equivalence So, eg, if g 0 and (C) and (L) are feasible for all t, u 0, then we do get perfect equivalence JKL (CS@ISU) COM S 672: Lecture 4 19 / 20

Next Class Gradient Descent JKL (CS@ISU) COM S 672: Lecture 4 20 / 20