COM S 672: Advanced Topics in Computational Models of Learning Optimization for Learning

Similar documents
COM S 578X: Optimization for Machine Learning

Karush-Kuhn-Tucker Conditions. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Convex Optimization & Lagrange Duality

COM S 578X: Optimization for Machine Learning

Constrained Optimization and Lagrangian Duality

I.3. LMI DUALITY. Didier HENRION EECI Graduate School on Control Supélec - Spring 2010

Support vector machines

ICS-E4030 Kernel Methods in Machine Learning

Lecture 18: Optimization Programming

Nonlinear Programming and the Kuhn-Tucker Conditions

CS-E4830 Kernel Methods in Machine Learning

Introduction to Machine Learning Lecture 7. Mehryar Mohri Courant Institute and Google Research

Constrained Optimization

5. Duality. Lagrangian

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Lectures 9 and 10: Constrained optimization problems and their optimality conditions

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

University of California, Davis Department of Agricultural and Resource Economics ARE 252 Lecture Notes 2 Quirino Paris

Convex Optimization M2

Lecture: Duality.

Motivation. Lecture 2 Topics from Optimization and Duality. network utility maximization (NUM) problem:

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

WHY DUALITY? Gradient descent Newton s method Quasi-newton Conjugate gradients. No constraints. Non-differentiable ???? Constrained problems? ????

Convex Optimization Boyd & Vandenberghe. 5. Duality

Support Vector Machines: Maximum Margin Classifiers

Duality Uses and Correspondences. Ryan Tibshirani Convex Optimization

ISM206 Lecture Optimization of Nonlinear Objective with Linear Constraints

Lecture: Duality of LP, SOCP and SDP

Homework Set #6 - Solutions

Duality. Lagrange dual problem weak and strong duality optimality conditions perturbation and sensitivity analysis generalized inequalities

Lagrange Duality. Daniel P. Palomar. Hong Kong University of Science and Technology (HKUST)

Quiz Discussion. IE417: Nonlinear Programming: Lecture 12. Motivation. Why do we care? Jeff Linderoth. 16th March 2006

Support Vector Machines

4TE3/6TE3. Algorithms for. Continuous Optimization

Lecture 2: Linear SVM in the Dual

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Lecture 6: Conic Optimization September 8

EE/AA 578, Univ of Washington, Fall Duality

Optimality Conditions for Constrained Optimization

Optimization for Machine Learning

Nonlinear Optimization: What s important?

Generalization to inequality constrained problem. Maximize

Recita,on: Loss, Regulariza,on, and Dual*

Support Vector Machines and Kernel Methods

CS711008Z Algorithm Design and Analysis

Optimality, Duality, Complementarity for Constrained Optimization

Convex Optimization Overview (cnt d)

On the Method of Lagrange Multipliers

CS269: Machine Learning Theory Lecture 16: SVMs and Kernels November 17, 2010

Lecture 7: Convex Optimizations

Numerical Optimization

Chapter 2. Optimization. Gradients, convexity, and ALS

Finite Dimensional Optimization Part III: Convex Optimization 1

LP Duality: outline. Duality theory for Linear Programming. alternatives. optimization I Idea: polyhedra

Support Vector Machines, Kernel SVM

Machine Learning. Support Vector Machines. Manfred Huber

Lecture Notes on Support Vector Machine

Support Vector Machines

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Optimisation in Higher Dimensions

Mathematical Foundations -1- Constrained Optimization. Constrained Optimization. An intuitive approach 2. First Order Conditions (FOC) 7

subject to (x 2)(x 4) u,

14. Duality. ˆ Upper and lower bounds. ˆ General duality. ˆ Constraint qualifications. ˆ Counterexample. ˆ Complementary slackness.

IE 5531: Engineering Optimization I

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

E5295/5B5749 Convex optimization with engineering applications. Lecture 5. Convex programming and semidefinite programming

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

LECTURE 7 Support vector machines

Lagrangian Duality and Convex Optimization

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

10701 Recitation 5 Duality and SVM. Ahmed Hefny

Convex Optimization and Modeling

Convex Optimization. Lecture 12 - Equality Constrained Optimization. Instructor: Yuanzhang Xiao. Fall University of Hawaii at Manoa

CSCI 1951-G Optimization Methods in Finance Part 09: Interior Point Methods

More on Lagrange multipliers

10-725/ Optimization Midterm Exam

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

CSC 411 Lecture 17: Support Vector Machine

Optimization. A first course on mathematics for economists

A Brief Review on Convex Optimization

Appendix A Taylor Approximations and Definite Matrices

Support Vector Machines

Maximum likelihood estimation

Machine Learning A Geometric Approach

CONSTRAINT QUALIFICATIONS, LAGRANGIAN DUALITY & SADDLE POINT OPTIMALITY CONDITIONS

Enhanced Fritz John Optimality Conditions and Sensitivity Analysis

Lecture Note 5: Semidefinite Programming for Stability Analysis

NONLINEAR. (Hillier & Lieberman Introduction to Operations Research, 8 th edition)

1. f(β) 0 (that is, β is a feasible point for the constraints)

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

CSCI5654 (Linear Programming, Fall 2013) Lecture-8. Lecture 8 Slide# 1

CSCI : Optimization and Control of Networks. Review on Convex Optimization

EE 227A: Convex Optimization and Applications October 14, 2008

Tutorial on Convex Optimization: Part II

Machine Learning Techniques

Chap 2. Optimality conditions

Nonlinear Optimization

Linear & nonlinear classifiers

Lecture 14: Optimality Conditions for Conic Problems

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Transcription:

COM S 672: Advanced Topics in Computational Models of Learning Optimization for Learning Lecture Note 4: Optimality Conditions Jia (Kevin) Liu Assistant Professor Department of Computer Science Iowa State University, Ames, Iowa, USA Fall 2017 JKL (CS@ISU) COM S 672: Lecture 4 1/20

Recap Last Lecture Given a minimization problem Minimize f(x) subject to g i (x) apple 0, i =1,,m u i 0 We define the Lagrangian: h j (x) =0, j =1,,p v j unconstrained L(x, u, v) =f(x)+ and the Lagrangian dual function: mx u i g i (x)+ i=1 (u, v) =min x L(x, u, v) px v j h j (x) j=1 JKL (CS@ISU) COM S 672: Lecture 4 2/20

Recap Last Lecture The subsequent Lagrangian dual problem is: Important properties: Maximize (u, v) subject to u 0 Dual problem is always convex (or is always concave), even if the primal problem is non-convex The weak duality property always hold, ie, the primal and dual optimal values p and d satisfy p d Slater s condition: for convex primal, if 9 and Bon x such that hptl ) f 1 (x) < 0,,f m (x) < 0 and h 1 (x) =0,,h(p)(x) pass =0 then strong duality holds: p = d JKL (CS@ISU) COM S 672: Lecture 4 3/20

Outline Today: KKT conditions Geometric interpretation Relevant examples in machine learning and other areas JKL (CS@ISU) COM S 672: Lecture 4 4/20

Karash-Kuhn-Tucker Conditions Given general problem Minimize f(x) subject to g i (x) apple 0, i =1,,m u i 0 h j (x) =0, j =1,,p v j unconstrained The Karash-Kuhn-Tucker (KKT) conditions are: Stationarity (ST): r x f(x)+ P m i=1 u ir x g i (x)+ P p j=1 v jr x h j (x) =0 = Complementary slackness (CS): u i g i (x) =0, 8i Primal feasibility (PF): f i apple 0, h j =0, 8i, j Dual feasibility (DF): u i 0, 8i for any given Ii either ui=o or gicn ) = o JKL (CS@ISU) COM S 672: Lecture 4 5/20

KKT Necessity Theorem 1 If x and u, v be primal and dual solutions w/ zero duality gap (eg, implied by Slater s condition), then (x, u, v ) satisfy KKT conditions Proof We have PF and DF for free from the assumption Also, x and (u, v ) are primal & dual solutions with strong duality ) strong duality f(x )= (u, v )=min x apple f(x )+ f(x)+ mx u i g i (x )+ i=1 mx u i g i (x)+ i=1 px vj h j (x) j=1 px vj h j (x ) apple f(x ) j=1 That is, all these inequalities are equalities Then: se primal opt y*, y* dual opt o DF PF } 54 ( z*, #, - ZU E PF go an - y*, being a KKT pt x minimizes L(x, u, v ) over x 2 R n (unconstrained) ) Gradient of L(x, u, v ) must be 0 at x,ie,thestationarity condition Since u i h i(x ) apple 0 (PF & DF), we must P m i=1 u i h i(x )=0,ie, complementary slackness condition JKL (CS@ISU) COM S 672: Lecture 4 6/20 o

KKT Su ciency CITY 't, It ) is KKT pt ft* ' 3 primal opt, e*iv* is dual opt, Theorem 2 If x and (u, v ) satisfy KKT conditions, then x and (u, v ) are primal and dual optimal solutions, respectively Proof If x and (u, v ) satisfy KKT conditions, then mx px (u, v ) (a) = f(x )+ u i g i (x )+ vj h j (x ) from CST ), a Hz*, y*, 1*1=0 It is a minimizer of (b) = f(x ), i=1 (g) j=1 - - = = 1PF) 0 the, uii* ) where (a) follows from ST and (b) follows from CS Therefore, the duality gap is zero Note that x and (u, v ) are PF and DF Hence, they are primal and dual optimal, respectively JKL (CS@ISU) COM S 672: Lecture 4 7/20

In Summary So putting things together Theorem 3 For a convex optimization problem with strong duality (eg, implied by Slater s conditions or other constraints qualifications): x and (u, v ) are primal and dual solutions () x and (u, v ) satisfy KKT conditions Warning: This statement is only true for convex optimization problems For non-convex optimization problems, KKT conditions are neither necessary nor su cient! (more on this shortly) JKL (CS@ISU) COM S 672: Lecture 4 8/20

Where Does this Name Come From? Older books/papers referred to this as the KT (Kuhn-Tucker) conditions First appeared in a publication by Kuhn and Tucker in 1951 Kuhn & Tucker shared the John von Neumann Theory Prize in 1980 Later people realized that Karush had the same conditions in his unpublished master s thesis in 1939 William Karush Harold W Kuhn Albert W Tucker A Fun Read: R W Cottle, William Karush and the KKT Theorem, Documenta Mathematica, 2012, pp 255-269 JKL (CS@ISU) COM S 672: Lecture 4 9/20

Other Optimality Conditions KKT conditions are a special case of the more general Fritz John Conditions: u 0 rf(x )+ where u 0 could be 0 mx u i rg i (x )+ i=1 px v j rh j (x )=0 In turn, Fritz John conditions (hence KKT) belong to a wider class of the first-order necessary conditions (FONC), which allow for non-smooth functions using subderivatives j=1 Further, there are a whole class second-order necessary & su conditiosn (SONC,SOSC) also in KKT style cient For an excellent treatment on optimality conditions, see [BSS, Ch4 Ch6] JKL (CS@ISU) COM S 672: Lecture 4 10 / 20

- - Geometric Interpretation of KKT 6TH at # I}*ff9iH*Hui= descent dir *feet " ) : IF " %- : xgitets,,, Seery 9*0%1/1 - - Contour gdek0 \ plane of of ohjtn the of binding constants at 't z ( aetivytight ) Physics interpretation : pulling force ' repelling force " * #s 'ffifai Eft te* E k C feet ) is contained in set the cone defined by the normal rectors binding const of HE )±{ i : gionitro }, JKL (CS@ISU) COM S 672: Lecture 4 11 / 20

convex Convex - When is KKT neither su cient nor necessary? (Not necc): x is a (local) minimum ; x is a KKT point gzpt ) is non - gets, % * agree ) t agent ' ) =\ magic # tuixgdat) ie#ee*ieeiia:t:tt*hi:iiii 91124<-0 (Not su ): x is a KKT point ; x is a (local) minimum get ) is non - point, must take uo=o ) - obj for : min EE 9411<-0 I* is KKT & pt : 7 U,, Uz 70 9%6*-9 :L such that e f gda*l-tdd set * = uasi im#a*s Simko But, z* is not opt :gdz*j eg EE<Ez*, # JKL (CS@ISU) COM S 672: Lecture 4 12 / 20

KKT Example 1: Quadratic Problems with Equality Constraints Consider for Q 0, the following quadratic programming problem is: uic e, 1 2 x> Qx + c > x subject to Ax = 0 u Lagrangian IE 'Q=z + Ee : * Minimize x tainted est : Qate apple Q A > A convex problem w/o inequality constraints By KKT, x is primal optimal i apple apple x c CPFI three = system : A 0 u 0 4 F) IS ) : Implied by (PF) for some dual variable u A linear equation system combines ST & PF (CS Let E±z- z*, and DF vacuous) Q= # tax, Hk E'QEtEE Often feyxfczitsftxfcnxy Aie#?tnoteaz*=b arises from using Newton s method to solved equality-constrained problems {min x equity C f(x) Ax = b} By Taylor 's E) soappannm HE- : + lx=z*5t±ee*c z*s +041199 Further, we want #gg=o JKL (CS@ISU) COM S 672: Lecture 4 13 / 20

- - yitdxitb - Imzttieii Ei Example 2: Support Vector Machine kf i#* *II * Given labels y 2 { 1, 1} n,featurevectorsx 1,,x m LetX, [x 1,,x m ] > Recall from Lecture 1 that the support vector machine problem: -5 Minimize w,b, 1 2 kwk2 + C mx i=1 subject to y i (w > x i + b) 1 i, i =1,,m i i 0, i =1,,m uiso fi Vigo, ti Introducing dual variables u, v 0 to obtain the KKT system: tllwlptc Lagrangian : Take deri : W nt w_ mx mx (ST): 0= u i y i, w = u i y i x i, u = C1 v i=1 i=1 (CS): v i i =0, u i 1 i y i (x > i w + b) =0, i =1,,m a ei,, uigizi =O time ui[, b Kei : TI want ) ] Qhffohmmtnkgmb, Vi=o Hi,uiyi=o, wort : C- ui - JKL (CS@ISU) COM S 672: Lecture 4 14 / 20?

- Example 2: Support Vector Machine p a =X= piag { y, Hence, at optimality, we have w = P m i=1 u iy i x i,andu i is nonzero only if y i (x > i w + b) =1 - i Such points are called the support points For support point i, if i =0,thenx i lies on the edge of margin and u i 2 (0,C] For support point i, if i 6=0,thenx i lies on wrong side of margin, and u i = C ym } 1=14 1 kwk 1 kwk 0 11 : 1 R D A 4 3 5 2 0 a O 2 Margin: kwk w > x + b =0 KKT conditions do not really give us a way to find solution here, but gives better understanding & useful in proofs In fact, we can use this to screen away nonsupport points before performing optimization (lower-complexity) JKL (CS@ISU) COM S 672: Lecture 4 15 / 20

great Example 3: Water-filling Example from [BV]: Consider the problem sbneuheraihan pie : i maxstogkitnis nx ga src s d Minimize log( i + x i ) st s f x i=1 subject to x 0, 1 > x =1, i =1,,m In Information Theory: log( i 470 var (normalised o t ) power constraint + x i ) is the communication rate of ith channel Introducing dual variables u, v 0 to obtain the KKT system: (ST): 1/( i + x i ) u i + v =0, i =1,,n (CS): u i x i =0, i =1,,n (PF): x 0, 1 > x =1 (DF): u 0,v unconstrained Eliminating u yields: implied by 1/( i + x i ) apple v, i =1,,n x i (v 1/( i + x i )) = 0, i =1,,n, x 0, 1 > x =1 Ui > 0 1 JKL (CS@ISU) COM S 672: Lecture 4 16 / 20

i Example 3: Water-filling ST and CS implies that: x i = 2 ( if ki=0 1/v i if v<1/ i =) x i = max{0, 1/v i }, i =1,,n 0 if v 1/ i Also, from PF, ie, 1 > x =1,wehave: nx max{0, 1/v i } =1 i=1 CCS ) CST ) 1 rf xi to ni = o v= I Litki VI # Univariate equation, piecewise linear in 1/v and not hard to solve This reduced problem is referred to as the water-filling solution (From [BV], pp 246) 1 : : *, a ;,!1 1 water Level JKL (CS@ISU) COM S 672: Lecture 4 17 / 20

- ' " fall Constrained and Lagrange Forms sut E E Often in ML and Stats, we ll switch back and forth between constrained form, where t 2 R is a tuning parameter and Lagrange form, where u (C): min f(x) subject to g(x) apple t age # x 0 is a tuning parameter (L): min x f(x)+u g(x) and claim these are equivalent Is this true (assuming f and g convex)? Proof (C) to (L): If Problem (C) is strictly feasible, then strong duality holds (why?), and there exists some u 0 (dual solution) such that any solution x in (C) minimizes # + ut txitkugc Clearly, x is also a solution in (L) f(x)+u g(x) t sparsity Regularized Optimization :C spoiler) * Tikhono regulation : mzhfcz ) ttycx ) * Morrow a " " ' : min 411 ) Slater 's JKL (CS@ISU) COM S 672: Lecture 4 18 / 20

Constrained and Lagrange Forms (L) to (C): If x is a solution in (L), then the KKT conditions for (C) are satisfied by taking t = g(x ),sox is a solution in (C) Putting things together: [ n o solutions in (L) u 0 [ u 0 n o solutions in (L) [ t [ tsuch that (C) is strictly feasible n o solutions in (C) n o solutions in (C) Ie, nearly perfect equivalence Note: If the only value of t that leads to a feasible but not strictly feasible constraint set is t =0, then we do get perfect equivalence So, eg, if g 0 and (C) and (L) are feasible for all t, u 0, then we do get perfect equivalence JKL (CS@ISU) COM S 672: Lecture 4 19 / 20

Next Class Gradient Descent JKL (CS@ISU) COM S 672: Lecture 4 20 / 20