COM S 578X: Optimization for Machine Learning

Size: px

Start display at page:

Download "COM S 578X: Optimization for Machine Learning"

Eugene Carson
5 years ago
Views:

1 COM S 578X: Optimization for Machine Learning Lecture Note 5: Optimality Conditions Jia (Kevin) Liu Assistant Professor Department of Computer Science Iowa State University Ames Iowa USA Fall 2018 JKL (CS@ISU) COM S 578X: Lecture 5 1/18

2 Recap Last Lecture Given a minimization problem Minimize f(x) subject to g i (x) apple 0 i =1m u i 0 We define the Lagrangian: h j (x) =0 j =1p v j unconstrained L(x u v) =f(x)+ and the Lagrangian dual function: mx u i g i (x)+ i=1 (u v) =min x L(x u v) px v j h j (x) j=1 JKL (CS@ISU) COM S 578X: Lecture 5 2/18

3 Recap Last Lecture The subsequent Lagrangian dual problem is: Important properties: Maximize (u v) subject to u 0 Dual problem is always convex (or is always concave) even if the primal problem is nonconvex The weak duality property always holds ie the primal and dual optimal values p and d satisfy p d Slater s condition: for convex primal if 9 x such that f 1 (x) < 0f m (x) < 0 and h 1 (x) =0h p (x) =0 then strong duality holds: p = d JKL (CS@ISU) COM S 578X: Lecture 5 3/18

4 Outline Today: KKT conditions Geometric interpretation Relevant examples in machine learning and other areas JKL COM S 578X: Lecture 5 4/18

5 KarushKuhnTucker Conditions Given general problem Minimize f(x) subject to g i (x) apple 0 i =1m u i 0 h j (x) =0 j =1p v j unconstrained Lek 417 = feel tug cast IT he ) The KarushKuhnTucker (KKT) conditions are: Stationarity (ST): r x f(x)+ P m i=1 u ir x g i (x)+ P p j=1 v jr x h j (x) =0 Complementary slackness (CS): u i g i (x) =0 8i Primal feasibility (PF): g i (x) apple 0 h j (x) =0 8i j Dual feasibility (DF): u i 0 8i either Ui = o or gift ) = o JKL (CS@ISU) COM S 578X: Lecture 5 5/18

6 ex } KKT Necessity Theorem 1 If x and u v be primal and dual solutions w/ zero duality gap (eg implied by convexity and Slater s condition) then (x u v ) satisfy KKT conditions Proof We have PF and DF for free from the assumption Also x and (u v ) are primal & dual solutions with strong duality ) That is all these inequalities are equalities Then: at strong primal opt 7 CE 't 'd ) dual being opt c Ii strong def of duality dual for mx px f(x )= (u 1d v )=min f(x)+ u i g i (x)+ vj h j (x) x depot i=1 j=1 mx DF PF px PF min I apple f(x go )+ u EO i g i (x )+ vj to h j (x ) apple f(x g ) A i=1 j=1 SO =o LCEMlk ) = ET x minimizes L(x u v ) over x 2 R n (unconstrained) ) Gradient of L(x u v ) must be 0 at x iethestationarity condition Since u i g i(x ) apple 0 (PF & DF) we must have P m B i=1 u i g i(x )=0ie complementary slackness condition TO KKT JKL (CS@ISU) COM S 578X: Lecture 5 6/18 get

7 KKT Su ciency ( Et y B duel opt cut Theorem 2 If the primal problem is convex Its and x and (u v ) satisfy KKT conditions then x and (u v ) are primal and dual optimal solutions respectively Proof If x and (u v ) satisfy KKT conditions then L = fat ) t IT find t IT Ice ) From CST ) : af mx CE't UT I )=Q at is a px (u v ) (a) minimizer = f(x )+ u i g i (x )+ vj go h j (x of ) LCE it 't 1*7 A 1ST i=1 j=1 = O ) = o (b) = f(x Ccs ) ) CPF ) where (a) follows from ST and (b) follows from CS 't I 't ) is KKT primal is convex } { It is primal opt Therefore the duality gap is zero Note that x and (u v ) are PF and DF Hence they are primal and dual optimal respectively TO JKL (CS@ISU) COM S 578X: Lecture 5 7/18

8 In Summary So putting things together Theorem 3 For a convex optimization problem with strong duality (eg implied by Slater s conditions or other constraints qualifications): x and (u v ) are primal and dual solutions () x and (u v ) satisfy KKT conditions Warning: This statement is only true for convex optimization problems For nonconvex optimization problems KKT conditions are neither necessary nor su cient! (more on this shortly) JKL (CS@ISU) COM S 578X: Lecture 5 8/18

appeared in a publication by Kuhn and Tucker in 1951 Kuhn & Tucker shared the John

conditions in his unpublished master s thesis in 1939 William Karush Harold W Kuhn

9 Where Does This Name Come From? Older books/papers referred to this as the KT (KuhnTucker) conditions First appeared in a publication by Kuhn and Tucker in 1951 Kuhn & Tucker shared the John von Neumann Theory Prize in 1980 Later people realized that Karush had the same conditions in his unpublished master s thesis in 1939 William Karush Harold W Kuhn Albert W Tucker A Fun Read: R W Cottle William Karush and the KKT Theorem Documenta Mathematica 2012 pp JKL (CS@ISU) COM S 578X: Lecture 5 9/18

10 Other Optimality Conditions KKT conditions are a special case of the more general Fritz John Conditions: u 0 rf(x )+ where u 0 could be 0 mx u i rg i (x )+ i=1 px v j rh j (x )=0 In turn Fritz John conditions (hence KKT) belong to a wider class of the firstorder necessary conditions (FONC) which allow for nonsmooth functions using subderivatives j=1 Further there are a whole class secondorder necessary & su conditiosn (SONCSOSC) also in KKT style cient For an excellent treatment on optimality conditions see [BSS Ch4 Ch6] JKL (CS@ISU) COM S 578X: Lecture 5 10 / 18

11 ICz*gE{ physics afcz* Geometric Interpretation of KKT Set of binding constraints at at C active ctnghtg i' gott 'T = o } Ccs ) deer interpretation : : ) pulling force * i gope 'T ie ICE 't ) : 9ft 'T Afca 'T vs again sum _ O JKL (CS@ISU) COM S 578X: Lecture 5 11 / 18

12 drr When is KKT neither su cient nor necessary? (Not necc): x is a (local) minimum ; x is a KKT point m grotto It is but opt NOT Hey deer # E feats uxgiatttuixgee's E Kouga tuua > o g; qµ K I Note : It so is Fritz John g pt ) no 01 (Not su ): x is a KKT point ; x is a (local) minimum 942 ) SO guy ) is nonconvex ohj : min ETZ Gekko gaa*yt# ;: #qdy ± * is KKT Fu : a so smh that JKL (CS@ISU) COM S 578X: Lecture 5 12 / 18

13 A o at att IT Example 1: Quadratic Problems with Equality Constraints Consider for Q 0 the following quadratic programming problem is: Lagrangian I ETE at etat IT LAI ) : 1 Minimize x 2 x> Qx + c > x subject to Ax = 0 0u A convex problem w/o inequality constraints By KKT x is primal optimal i CST ) : = Eat Et ATI E apple apple apple Q A > x c CPF) = : = A a e A 0 u 0 c Df ) & Ccs : ) Implied by CPF) for some dual variable u A linear equation system combines ST & PF (CS and DF vacuous) Often arises from using Newton s method to solved equalityconstrained problems {min x f(x) Ax = b} flat t the # cattle tochaa*# = fealty + after Tca I Tim const By Taylor 's so expansion : 't Note : Aaa =L E tee I k = It Cz zit En JKL (CS@ISU) COM S 578X: Lecture 5 13 / 18 Q

14 R b Ui Example 2: Support Vector Machine Given labels y 2 { 1 1} n featurevectorsx 1 x m LetX[x 1 x m ] > win Recall from Lecture 1 that the support vector machine problem: tbh Minimize wb 1 mx 2 kwk2 + C i=1 subject to y i (w > x i + b) 1 i i =1m uizofi * i c pts { i 0 i =1m Visconti Slater 's condition holds ( DF ) Introducing dual variables u v 0 to obtain the KKT system: Quadratic eh y Lagrangian Ellen 't C Ei tme will Ei Yi Htt it b) A Tavi mx mx Ei : II Agne in q b (ST): 0= u i y i w = u i y i x i u = C1 v i=1 i=1 (CS): v i i =0 u i 1 i y i (x > i w + b) =0 i =1m i Taking der w t we uiyin = Q wrto :! uiyi=o writ Ei : C Vi = or Vi JKL (CS@ISU) COM S 578X: Lecture 5 14 / 18

15 Example 2: Support Vector Machine re = Diagmie= Ey Hence at optimality we have w = P m i=1 u iy i x i andu i is nonzero only if y i (x > i w + b) =1 i Such points are called the support points For support point i if i =0thenx i lies on the edge of margin and ' ' u i 2 (0C] 4=0 Viso ke CE For support point i if i 6=0thenx i lies on wrong side of margin and u i = C Ei to Vi o a CE = 2 Margin: kwk w > x + b =0 5 KKT conditions do not really give us a way to find solution here but gives better understanding & useful in proofs 1 kwk In fact we can use this to screen away nonsupport points before performing optimization (lowercomplexity) 1 kwk JKL (CS@ISU) COM S 578X: Lecture 5 15 / 18

16 I Ni Example 3: Waterfilling Example from [BV]: Consider the problem Minimize x nx log( i + x i ) 43 I i=1 n VER I ai = I subject to x 0 1 > x =1 i =1m In Information Theory: log( i + x i ) is the communication rate of ith channel Introducing dual variables u v 0 to obtain the KKT system: (ST): 1/( i + x i ) u i + v =0 i =1n (CS): u i x i =0 i =1n (PF): x 0 1 > x =1 (DF): u Eliminating u yields: 1/( i + x i ) apple v 0v unconstrained i =1n Arise from IT : log Cali tai ) : Set LCE x i (v 1/( i + x i )) = 0 i =1n x 0 1 > x =1 = hog ai t lag f It ) sad i : 2L Tai Ui = a I ) = II TE ni Mi tv ( Epa q 1 Li tri ni 30 I lug Hit ai ) ) Ui TV = o 20 tv Hi Hi JKL (CS@ISU) COM S 578X: Lecture 5 16 / 18

17 Example 3: Waterfilling ST and CS implies that: ( 1/v i if v<1/ i x i = =) x i = max{0 1/v i } i =1n 0 if v 1/ i Also from PF ie 1 > x =1wehave: nx max{0 1/v i } =1 i=1 Univariate In of u linear in terms of I Univariate equation piecewise linear in 1/v and not hard to solve This reduced problem is referred to as the waterfilling solution (From [BV] pp 246) water level I ; i i ' i l ; I I I I! I I i I ' JKL (CS@ISU) COM S 578X: Lecture 5 17 / 18

18 Next Class Gradient Descent JKL COM S 578X: Lecture 5 18 / 18

COM S 672: Advanced Topics in Computational Models of Learning Optimization for Learning

COM S 672: Advanced Topics in Computational Models of Learning Optimization for Learning Lecture Note 4: Optimality Conditions Jia (Kevin) Liu Assistant Professor Department of Computer Science Iowa State