COM S 578X: Optimization for Machine Learning

Similar documents
COM S 672: Advanced Topics in Computational Models of Learning Optimization for Learning

Karush-Kuhn-Tucker Conditions. Lecturer: Ryan Tibshirani Convex Optimization /36-725

COM S 578X: Optimization for Machine Learning

I.3. LMI DUALITY. Didier HENRION EECI Graduate School on Control Supélec - Spring 2010

ICS-E4030 Kernel Methods in Machine Learning

Constrained Optimization and Lagrangian Duality

Support vector machines

Lecture 18: Optimization Programming

CS-E4830 Kernel Methods in Machine Learning

Convex Optimization & Lagrange Duality

Quiz Discussion. IE417: Nonlinear Programming: Lecture 12. Motivation. Why do we care? Jeff Linderoth. 16th March 2006

5. Duality. Lagrangian

Motivation. Lecture 2 Topics from Optimization and Duality. network utility maximization (NUM) problem:

Convex Optimization M2

Lecture: Duality.

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Support Vector Machines: Maximum Margin Classifiers

Convex Optimization Boyd & Vandenberghe. 5. Duality

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Lagrange Duality. Daniel P. Palomar. Hong Kong University of Science and Technology (HKUST)

University of California, Davis Department of Agricultural and Resource Economics ARE 252 Lecture Notes 2 Quirino Paris

Duality. Lagrange dual problem weak and strong duality optimality conditions perturbation and sensitivity analysis generalized inequalities

Constrained Optimization

Support Vector Machines

Lectures 9 and 10: Constrained optimization problems and their optimality conditions

Lecture 6: Conic Optimization September 8

Introduction to Machine Learning Lecture 7. Mehryar Mohri Courant Institute and Google Research

WHY DUALITY? Gradient descent Newton s method Quasi-newton Conjugate gradients. No constraints. Non-differentiable ???? Constrained problems? ????

Optimality, Duality, Complementarity for Constrained Optimization

ISM206 Lecture Optimization of Nonlinear Objective with Linear Constraints

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Optimization for Machine Learning

CS269: Machine Learning Theory Lecture 16: SVMs and Kernels November 17, 2010

Lecture: Duality of LP, SOCP and SDP

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

4TE3/6TE3. Algorithms for. Continuous Optimization

Lecture 2: Linear SVM in the Dual

Nonlinear Programming and the Kuhn-Tucker Conditions

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

EE/AA 578, Univ of Washington, Fall Duality

NONLINEAR. (Hillier & Lieberman Introduction to Operations Research, 8 th edition)

Duality Uses and Correspondences. Ryan Tibshirani Convex Optimization

Optimality Conditions for Constrained Optimization

14. Duality. ˆ Upper and lower bounds. ˆ General duality. ˆ Constraint qualifications. ˆ Counterexample. ˆ Complementary slackness.

Support Vector Machines

Machine Learning A Geometric Approach

Support Vector Machines and Kernel Methods

Homework Set #6 - Solutions

On the Method of Lagrange Multipliers

Linear, Binary SVM Classifiers

Nonlinear Optimization: What s important?

subject to (x 2)(x 4) u,

Numerical Optimization

E5295/5B5749 Convex optimization with engineering applications. Lecture 5. Convex programming and semidefinite programming

Lecture Notes on Support Vector Machine

A Brief Review on Convex Optimization

Recita,on: Loss, Regulariza,on, and Dual*

LECTURE 7 Support vector machines

Enhanced Fritz John Optimality Conditions and Sensitivity Analysis

CS711008Z Algorithm Design and Analysis

Generalization to inequality constrained problem. Maximize

More on Lagrange multipliers

Lagrange duality. The Lagrangian. We consider an optimization program of the form

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

CSCI : Optimization and Control of Networks. Review on Convex Optimization

Lecture 7: Convex Optimizations

Pattern Classification, and Quadratic Problems

LP Duality: outline. Duality theory for Linear Programming. alternatives. optimization I Idea: polyhedra

LINEAR PROGRAMMING II

Duality. Geoff Gordon & Ryan Tibshirani Optimization /

IE 5531: Engineering Optimization I

Convex Optimization Lecture 6: KKT Conditions, and applications

LECTURE 10 LECTURE OUTLINE

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

Support Vector Machines, Kernel SVM

Gradient Descent. Dr. Xiaowei Huang

Convex Optimization. Lecture 12 - Equality Constrained Optimization. Instructor: Yuanzhang Xiao. Fall University of Hawaii at Manoa

Lecture 8. Strong Duality Results. September 22, 2008

CSC 411 Lecture 17: Support Vector Machine

Machine Learning. Support Vector Machines. Manfred Huber

CSCI5654 (Linear Programming, Fall 2013) Lecture-8. Lecture 8 Slide# 1

Chapter 2. Optimization. Gradients, convexity, and ALS

Support Vector Machines

Support Vector Machines

Support Vector Machines for Classification and Regression

Introduction to Machine Learning Spring 2018 Note Duality. 1.1 Primal and Dual Problem

Chap 2. Optimality conditions

CE 191: Civil & Environmental Engineering Systems Analysis. LEC 17 : Final Review

Lagrangian Duality and Convex Optimization

Optimization. A first course on mathematics for economists

Tutorial on Convex Optimization: Part II

Linear and Combinatorial Optimization

Lecture 3. Optimization Problems and Iterative Algorithms

10701 Recitation 5 Duality and SVM. Ahmed Hefny

CONSTRAINT QUALIFICATIONS, LAGRANGIAN DUALITY & SADDLE POINT OPTIMALITY CONDITIONS

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 4. Subgradient

Lecture Note 5: Semidefinite Programming for Stability Analysis

Machine Learning Techniques

Duality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Transcription:

COM S 578X: Optimization for Machine Learning Lecture Note 5: Optimality Conditions Jia (Kevin) Liu Assistant Professor Department of Computer Science Iowa State University Ames Iowa USA Fall 2018 JKL (CS@ISU) COM S 578X: Lecture 5 1/18

Recap Last Lecture Given a minimization problem Minimize f(x) subject to g i (x) apple 0 i =1m u i 0 We define the Lagrangian: h j (x) =0 j =1p v j unconstrained L(x u v) =f(x)+ and the Lagrangian dual function: mx u i g i (x)+ i=1 (u v) =min x L(x u v) px v j h j (x) j=1 JKL (CS@ISU) COM S 578X: Lecture 5 2/18

Recap Last Lecture The subsequent Lagrangian dual problem is: Important properties: Maximize (u v) subject to u 0 Dual problem is always convex (or is always concave) even if the primal problem is nonconvex The weak duality property always holds ie the primal and dual optimal values p and d satisfy p d Slater s condition: for convex primal if 9 x such that f 1 (x) < 0f m (x) < 0 and h 1 (x) =0h p (x) =0 then strong duality holds: p = d JKL (CS@ISU) COM S 578X: Lecture 5 3/18

Outline Today: KKT conditions Geometric interpretation Relevant examples in machine learning and other areas JKL (CS@ISU) COM S 578X: Lecture 5 4/18

KarushKuhnTucker Conditions Given general problem Minimize f(x) subject to g i (x) apple 0 i =1m u i 0 h j (x) =0 j =1p v j unconstrained Lek 417 = feel tug cast IT he ) The KarushKuhnTucker (KKT) conditions are: Stationarity (ST): r x f(x)+ P m i=1 u ir x g i (x)+ P p j=1 v jr x h j (x) =0 Complementary slackness (CS): u i g i (x) =0 8i Primal feasibility (PF): g i (x) apple 0 h j (x) =0 8i j Dual feasibility (DF): u i 0 8i either Ui = o or gift ) = o JKL (CS@ISU) COM S 578X: Lecture 5 5/18

ex } KKT Necessity Theorem 1 If x and u v be primal and dual solutions w/ zero duality gap (eg implied by convexity and Slater s condition) then (x u v ) satisfy KKT conditions Proof We have PF and DF for free from the assumption Also x and (u v ) are primal & dual solutions with strong duality ) That is all these inequalities are equalities Then: at strong primal opt 7 CE 't 'd ) dual being opt c Ii strong def of duality dual for mx px f(x )= (u 1d v )=min f(x)+ u i g i (x)+ vj h j (x) x depot i=1 j=1 mx DF PF px PF min I apple f(x go )+ u EO i g i (x )+ vj to h j (x ) apple f(x g ) A i=1 j=1 SO =o LCEMlk ) = ET x minimizes L(x u v ) over x 2 R n (unconstrained) ) Gradient of L(x u v ) must be 0 at x iethestationarity condition Since u i g i(x ) apple 0 (PF & DF) we must have P m B i=1 u i g i(x )=0ie complementary slackness condition TO KKT JKL (CS@ISU) COM S 578X: Lecture 5 6/18 get

KKT Su ciency ( Et y B duel opt cut Theorem 2 If the primal problem is convex Its and x and (u v ) satisfy KKT conditions then x and (u v ) are primal and dual optimal solutions respectively Proof If x and (u v ) satisfy KKT conditions then L = fat ) t IT find t IT Ice ) From CST ) : af mx CE't UT I )=Q at is a px (u v ) (a) minimizer = f(x )+ u i g i (x )+ vj go h j (x of ) LCE it 't 1*7 A 1ST i=1 j=1 = O ) = o (b) = f(x Ccs ) ) CPF ) where (a) follows from ST and (b) follows from CS 't I 't ) is KKT primal is convex } { It is primal opt Therefore the duality gap is zero Note that x and (u v ) are PF and DF Hence they are primal and dual optimal respectively TO JKL (CS@ISU) COM S 578X: Lecture 5 7/18

In Summary So putting things together Theorem 3 For a convex optimization problem with strong duality (eg implied by Slater s conditions or other constraints qualifications): x and (u v ) are primal and dual solutions () x and (u v ) satisfy KKT conditions Warning: This statement is only true for convex optimization problems For nonconvex optimization problems KKT conditions are neither necessary nor su cient! (more on this shortly) JKL (CS@ISU) COM S 578X: Lecture 5 8/18

Where Does This Name Come From? Older books/papers referred to this as the KT (KuhnTucker) conditions First appeared in a publication by Kuhn and Tucker in 1951 Kuhn & Tucker shared the John von Neumann Theory Prize in 1980 Later people realized that Karush had the same conditions in his unpublished master s thesis in 1939 William Karush Harold W Kuhn Albert W Tucker A Fun Read: R W Cottle William Karush and the KKT Theorem Documenta Mathematica 2012 pp 255269 JKL (CS@ISU) COM S 578X: Lecture 5 9/18

Other Optimality Conditions KKT conditions are a special case of the more general Fritz John Conditions: u 0 rf(x )+ where u 0 could be 0 mx u i rg i (x )+ i=1 px v j rh j (x )=0 In turn Fritz John conditions (hence KKT) belong to a wider class of the firstorder necessary conditions (FONC) which allow for nonsmooth functions using subderivatives j=1 Further there are a whole class secondorder necessary & su conditiosn (SONCSOSC) also in KKT style cient For an excellent treatment on optimality conditions see [BSS Ch4 Ch6] JKL (CS@ISU) COM S 578X: Lecture 5 10 / 18

ICz*gE{ physics afcz* Geometric Interpretation of KKT Set of binding constraints at at C active ctnghtg i' gott 'T = o } Ccs ) deer interpretation : : ) pulling force * i gope 'T ie ICE 't ) : 9ft 'T Afca 'T vs again sum _ O JKL (CS@ISU) COM S 578X: Lecture 5 11 / 18

drr When is KKT neither su cient nor necessary? (Not necc): x is a (local) minimum ; x is a KKT point m grotto It is but opt NOT Hey deer # E feats uxgiatttuixgee's E Kouga tuua > o g; qµ K I Note : It so is Fritz John g pt ) 791 41 no 01 (Not su ): x is a KKT point ; x is a (local) minimum 942 ) SO guy ) is nonconvex ohj : min ETZ Gekko gaa*yt# ;: #qdy ± * is KKT Fu : a so smh that JKL (CS@ISU) COM S 578X: Lecture 5 12 / 18

A o at att IT Example 1: Quadratic Problems with Equality Constraints Consider for Q 0 the following quadratic programming problem is: Lagrangian I ETE at etat IT LAI ) : 1 Minimize x 2 x> Qx + c > x subject to Ax = 0 0u A convex problem w/o inequality constraints By KKT x is primal optimal i CST ) : = Eat Et ATI E apple apple apple Q A > x c CPF) = : = A a e A 0 u 0 c Df ) & Ccs : ) Implied by CPF) for some dual variable u A linear equation system combines ST & PF (CS and DF vacuous) Often arises from using Newton s method to solved equalityconstrained problems {min x f(x) Ax = b} flat t the # cattle tochaa*# = fealty + after Tca I Tim const By Taylor 's so expansion : 't Note : Aaa =L E tee I k = It Cz zit En JKL (CS@ISU) COM S 578X: Lecture 5 13 / 18 Q

R b Ui Example 2: Support Vector Machine Given labels y 2 { 1 1} n featurevectorsx 1 x m LetX[x 1 x m ] > win Recall from Lecture 1 that the support vector machine problem: tbh Minimize wb 1 mx 2 kwk2 + C i=1 subject to y i (w > x i + b) 1 i i =1m uizofi * i c pts { i 0 i =1m Visconti Slater 's condition holds ( DF ) Introducing dual variables u v 0 to obtain the KKT system: Quadratic eh y Lagrangian Ellen 't C Ei tme will Ei Yi Htt it b) A Tavi mx mx Ei : II Agne in q b (ST): 0= u i y i w = u i y i x i u = C1 v i=1 i=1 (CS): v i i =0 u i 1 i y i (x > i w + b) =0 i =1m i Taking der w t we uiyin = Q wrto :! uiyi=o writ Ei : C Vi = or Vi JKL (CS@ISU) COM S 578X: Lecture 5 14 / 18

Example 2: Support Vector Machine re = Diagmie= Ey Hence at optimality we have w = P m i=1 u iy i x i andu i is nonzero only if y i (x > i w + b) =1 i Such points are called the support points For support point i if i =0thenx i lies on the edge of margin and ' ' u i 2 (0C] 4=0 Viso ke CE For support point i if i 6=0thenx i lies on wrong side of margin and u i = C Ei to Vi o a CE = 2 Margin: kwk w > x + b =0 5 KKT conditions do not really give us a way to find solution here but gives better understanding & useful in proofs 1 kwk 1 3 2 4 In fact we can use this to screen away nonsupport points before performing optimization (lowercomplexity) 1 kwk JKL (CS@ISU) COM S 578X: Lecture 5 15 / 18

I Ni Example 3: Waterfilling Example from [BV]: Consider the problem Minimize x nx log( i + x i ) 43 I i=1 n VER I ai = I subject to x 0 1 > x =1 i =1m In Information Theory: log( i + x i ) is the communication rate of ith channel Introducing dual variables u v 0 to obtain the KKT system: (ST): 1/( i + x i ) u i + v =0 i =1n (CS): u i x i =0 i =1n (PF): x 0 1 > x =1 (DF): u Eliminating u yields: 1/( i + x i ) apple v 0v unconstrained i =1n Arise from IT : log Cali tai ) : Set LCE x i (v 1/( i + x i )) = 0 i =1n x 0 1 > x =1 = hog ai t lag f It ) sad i : 2L Tai Ui = a I ) = II TE ni Mi tv ( Epa q 1 Li tri ni 30 I lug Hit ai ) ) Ui TV = o 20 tv Hi Hi JKL (CS@ISU) COM S 578X: Lecture 5 16 / 18

Example 3: Waterfilling ST and CS implies that: ( 1/v i if v<1/ i x i = =) x i = max{0 1/v i } i =1n 0 if v 1/ i Also from PF ie 1 > x =1wehave: nx max{0 1/v i } =1 i=1 Univariate In of u linear in terms of I Univariate equation piecewise linear in 1/v and not hard to solve This reduced problem is referred to as the waterfilling solution (From [BV] pp 246) water level I ; i i ' i l ; I I I I! I I i I ' JKL (CS@ISU) COM S 578X: Lecture 5 17 / 18

Next Class Gradient Descent JKL (CS@ISU) COM S 578X: Lecture 5 18 / 18