WHY DUALITY? Gradient descent Newton s method Quasi-newton Conjugate gradients. No constraints. Non-differentiable ???? Constrained problems? ????

Similar documents
This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization

Support Vector Machines and Kernel Methods

Constrained Optimization

I.3. LMI DUALITY. Didier HENRION EECI Graduate School on Control Supélec - Spring 2010

Nonlinear Optimization: What s important?

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Machine Learning. Support Vector Machines. Manfred Huber

5. Duality. Lagrangian

Lecture: Duality of LP, SOCP and SDP

Convex Optimization Boyd & Vandenberghe. 5. Duality

Constrained Optimization and Lagrangian Duality

Lecture 18: Optimization Programming

Support Vector Machines

COM S 578X: Optimization for Machine Learning

10-725/ Optimization Midterm Exam

Agenda. 1 Duality for LP. 2 Theorem of alternatives. 3 Conic Duality. 4 Dual cones. 5 Geometric view of cone programs. 6 Conic duality theorem

Support Vector Machines

Support vector machines

Support Vector Machines: Maximum Margin Classifiers

Duality. Lagrange dual problem weak and strong duality optimality conditions perturbation and sensitivity analysis generalized inequalities

ICS-E4030 Kernel Methods in Machine Learning

Convex Optimization & Lagrange Duality

Convex Optimization M2

Lecture: Duality.

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

Convex Optimization and Modeling

Lecture 3. Optimization Problems and Iterative Algorithms

Introduction to Machine Learning Lecture 7. Mehryar Mohri Courant Institute and Google Research

CS-E4830 Kernel Methods in Machine Learning

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Motivation. Lecture 2 Topics from Optimization and Duality. network utility maximization (NUM) problem:

Part 4: Active-set methods for linearly constrained optimization. Nick Gould (RAL)

Machine Learning A Geometric Approach

Introduction to Support Vector Machines

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Generalization to inequality constrained problem. Maximize

Solutions Chapter 5. The problem of finding the minimum distance from the origin to a line is written as. min 1 2 kxk2. subject to Ax = b.

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings

Support Vector Machine

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

5 Handling Constraints

Support Vector Machines

Numerical optimization

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

UNDERGROUND LECTURE NOTES 1: Optimality Conditions for Constrained Optimization Problems

Optimization for Machine Learning

Numerical Optimization

GRADIENT = STEEPEST DESCENT

Numerical optimization. Numerical optimization. Longest Shortest where Maximal Minimal. Fastest. Largest. Optimization problems

Optimality Conditions for Constrained Optimization

Optimisation in Higher Dimensions

(Kernels +) Support Vector Machines

Recita,on: Loss, Regulariza,on, and Dual*

A Brief Review on Convex Optimization

UC Berkeley Department of Electrical Engineering and Computer Science. EECS 227A Nonlinear and Convex Optimization. Solutions 5 Fall 2009

minimize x subject to (x 2)(x 4) u,

Lecture Notes on Support Vector Machine

Lectures 9 and 10: Constrained optimization problems and their optimality conditions

Constrained optimization

Gradient Descent. Dr. Xiaowei Huang

Optimization. Yuh-Jye Lee. March 28, Data Science and Machine Intelligence Lab National Chiao Tung University 1 / 40

Linear & nonlinear classifiers

Nonlinear Programming and the Kuhn-Tucker Conditions

subject to (x 2)(x 4) u,

Optimality, Duality, Complementarity for Constrained Optimization

Chapter 2. Optimization. Gradients, convexity, and ALS

Statistical Machine Learning from Data

Primal/Dual Decomposition Methods

Lecture 2: Linear SVM in the Dual

Written Examination

ECON 4117/5111 Mathematical Economics

Scientific Computing: Optimization

EE/AA 578, Univ of Washington, Fall Duality

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

CSCI : Optimization and Control of Networks. Review on Convex Optimization

Support Vector Machines for Regression

1. f(β) 0 (that is, β is a feasible point for the constraints)

Applications of Linear Programming

Mathematical Economics. Lecture Notes (in extracts)

Support Vector Machine (SVM) and Kernel Methods

Quiz Discussion. IE417: Nonlinear Programming: Lecture 12. Motivation. Why do we care? Jeff Linderoth. 16th March 2006

Proximal splitting methods on convex problems with a quadratic term: Relax!

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Two hours. To be provided by Examinations Office: Mathematical Formula Tables. THE UNIVERSITY OF MANCHESTER. xx xxxx 2017 xx:xx xx.

Line Search Methods for Unconstrained Optimisation

Linearized Alternating Direction Method: Two Blocks and Multiple Blocks. Zhouchen Lin 林宙辰北京大学

CONSTRAINT QUALIFICATIONS, LAGRANGIAN DUALITY & SADDLE POINT OPTIMALITY CONDITIONS

E5295/5B5749 Convex optimization with engineering applications. Lecture 5. Convex programming and semidefinite programming

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Linear and non-linear programming

4TE3/6TE3. Algorithms for. Continuous Optimization

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines

2.098/6.255/ Optimization Methods Practice True/False Questions

Numerical Optimization of Partial Differential Equations

Lecture 7: Convex Optimizations

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machines

Transcription:

DUALITY

WHY DUALITY? No constraints f(x) Non-differentiable f(x) Gradient descent Newton s method Quasi-newton Conjugate gradients etc???? Constrained problems? f(x) subject to g(x) apple 0???? h(x) =0

LAGRANGIAN Simple case f(x) subject to Ax b =0 Saddle-point form min x max f(x)+h, Ax bi Lagrangian

SADDLE-POINT FORM Simple case f(x) subject to Ax b =0 max f(x)+h, Ax bi = Saddle-point form min x max f(x)+h, Ax bi Lagrangian ( f(x), Ax b =0 1, otherwise min max f(x)+h, Ax bi =min x x f(x)

INEQUALITY CONSTRAINTS f(x) subject to Ax b =0 Cx d apple 0 min x max, 0 f(x)+h, Ax bi + h,cx di non-negative constraint Why does this work? Cx d apple 0: =0

GENERAL FORM LAGRANGIAN f(x) subject to g(x) apple 0 h(x) =0 Lagrangian L(x,, ) =f(x)+h,h(x)i + h,g(x)i Saddle-point formulation f(x? )=min x max, 0 f(x)+h,h(x)i + h,g(x)i

CALCULUS INTERPRETATION f(x) subject to h(x) =0 gradient of objective parallel to gradient of constraint rf(x) =rh(x) rf(x)+rh(x) =0 contours of objective linear constraint

CALCULUS INTERPRETATION f(x) subject to h(x) =0 gradient of objective parallel to gradient of constraint rf(x) =rh(x) rf(x)+rh(x) =0 L(x, )=f(x)+h,h(x)i min x optimality condition @ x L(x, )=rf(x)+rh(x) =0

INEQUALITY CONDITIONS f(x) subject to g(x) apple 0 L(x, ) =f(x)+h,g(x)i Inactive: min x max 0 f(x)+h,g(x)i =0 Active: > 0 x? x?

INEQUALITY CONDITIONS @ x L(x, )=rf(x)+rg(x) =0 Inactive: =0 Active: > 0

OPTIMALITY CONDITIONS: KKT SYSTEM f(x) subject to g(x) apple 0 h(x) =0 L(x,, ) =f(x)+h,h(x)i + h,g(x)i necessary conditions Karush-Kuhn-Tucker rf(x? )+ rh(x? )+ rg(x? )=0 Primal/dual optimality g(x? ) apple 0 h(x? )=0 0 i g i (x? )=0 } Primal feasibility Dual feasibility Comp slackness

DUAL FUNCTION d(, ) =min x L(x,, ) =min x f(x)+h,h(x)i + h,g(x)i Dual is concave? Why? Does this depend on convexity of f?

DUAL FUNCTION d(, ) =min x L(x,, ) =min x f(x)+h,h(x)i + h,g(x)i Dual is lower bound to optimal objective min x f(x)+h,h(x)i + h,g(x)i applef(x? )? Why? (because optimal x satisfies constraints)

GEOMETRIC INTERPRETATION OF LOWER BOUND f(x) subject to g(x) apple 0 h(x) =0 Implicit constraints f(x)+x 0 (h(x)) + X (g(x)) X (z) = ( 0, z apple 0 1, otherwise X 1 0

GEOMETRIC INTERPRETATION f(x) subject to g(x) apple 0 h(x) =0 Implicit constraints f(x)+x 0 (h(x)) + X (g(x)) X 1 linear lower bound X (z) hz, i, 0 0

GEOMETRIC INTERPRETATION f(x) subject to g(x) apple 0 h(x) =0 Implicit constraints f(x)+x 0 (h(x)) + X (g(x)) Lower bound f(x)+h,h(x)i + h,g(x)i

MAX OF DUAL max, 0 d(, ) = max min, 0 x f(x)+h,h(x)i + h,g(x)i??? L(x,, ) max, 0 d(, ) = max min, 0 x L(x,, ) =min x max, 0 L(x,, ) Solution Does maximizing dual = minimizing primal???

WEAK/STRONG DUALITY Weak duality max, 0 d(, ) apple f(x? ) Always holds: even for non-convex problems Strong duality max, 0 d(, ) =f(x? ) Holds most of the time for convex problems

SLATER S CONDITION f(x) subject to g(x) apple 0 Ax = b Slater s condition holds if there is a strictly feasible point Constraint qualification f(x) < 1 g(x) < 0 Ax = b Not strictly feasible Strictly feasible

SLATER S CONDITION f(x) subject to g(x) apple 0 Ax = b Slater s condition holds if there is a strictly feasible point f(x) < 1 g(x) < 0 Theorem Ax = b Convex+(linear equalities)+(slater s condition) = strong duality max min, 0 x L(x,, ) =min x max, 0 L(x,, ) max, 0 d(, ) =f(x? )

NON-HOMOGENOUS SVM w,b Choose line with largest margin: 2 kwk Push data to right side of line: 1 2 kwk2 + C X h[l i (d T i w b)] d T w b =1 h(l i [d T i w b]) d T w b =0

DUAL: THE HARD WAY w,b 1 2 kwk2 + C X h[l i (d T i w b)] w,b w,b 1 X 2 kwk2 + C max{1 l i (d T i w b), 0} Idea: make this differentiable 1 X 2 kwk2 + C max{1 L(Dw b), 0} w,b,x Standard form 1 2 kwk2 + Ch1,xi subject to x 1 L(Dw b) x 0

DUAL: THE HARD WAY w,b,x 1 2 kwk2 + Ch1,xi subject to x 1 L(Dw b) x 0 Positive multipliers 1 d(, )=min w,b,x2 kwk2 + Ch1,xi + h, 1 L(Dw b) xi + h, xi This is too complicated. Let s reduce it derivative for x C1 =0, with, 0 solution = C1, for 0 apple i apple C new constraint

DUAL: THE HARD WAY w,b,x 1 2 kwk2 + Ch1,xi subject to x 1 L(Dw b) x 0 1 d(, )=min w,b,x2 kwk2 + Ch1,xi + h, 1 L(Dw b) xi + h, xi hc1, xi d( ) =min w,b,x =min w,b,x = C1, for 0 apple i apple C 1 2 kwk2 + h, 1 L(Dw b)i 1 2 kwk2 + h, 1i h,ldwi h,lib

DUAL: THE HARD WAY d( ) =min w,b,x =min w,b,x w 1 2 kwk2 + h, 1 L(Dw b)i 1 2 kwk2 + h, 1i h,ldwi h,lib h,li =0 for w D T L =0, or w = D T L d( ) = 1 2 kdt L k 2 + h, 1i h,ldd T L i = 1 2 kdt L k 2 + h, 1i kd T L k 2 d( ) = 1 2 kdt L k 2 + h, 1i

ADVANTAGES OF DUAL w,b w = D T L Original problem 1 2 kwk2 + C max{1 L(Dw b), 0} non-differentiable Dual problem maximize d( ) = subject to 0 apple apple C h,li =0 1 2 kdt L k 2 + h, 1i smooth Solvable via coordinate descent LIBSVM

ADVANTAGES OF DUAL Dual problem maximize d( ) = 1 2 kdt L k 2 + h, 1i subject to 0 apple apple C h,li =0 kernel form d( ) = 1 2 T LDD T L kernel h, 1i subject to 0 apple apple C h,li =0

KERNELS (DD T ) ij = hd i,d j i = ( big, when d i and d j are close small, when d i and d j are far apart k(d i,d j )= kernel function ( big, when d i and d j are similar small, when d i and d j are di erent Dual SVM: only need to know similarity function Kernel methods: replace inner product with some other similarity

EXAMPLE: TWO MOONS k(d i,d j )= kernel function ( big, when d i and d j are similar small, when d i and d j are di erent 1.2 1 0.8 0.6 0.4 0.2 0 far close 0.2 0.4 0.6 0.8 1.5 1 0.5 0 0.5 1 1.5 2 2.5

EXAMPLE: TWO MOONS k(d i,d j )= kernel function ( big, when d i and d j are similar small, when d i and d j are di erent 1.2 1 0.8 0.6 k(x, y) =exp kx yk 2 2 2 0.4 0.2 0 far close 0.2 0.4 0.6 0.8 1.5 1 0.5 0 0.5 1 1.5 2 2.5

EXAMPLE: TWO MOONS subject to 0 apple apple C d( ) = 1 2 T LDD T L h,li =0 kernel h, 1i K ij = k(d i,d j ) d( ) = 1 2 T LKL h, 1i subject to 0 apple apple C h,li =0 Return to this later!!

CONJUGATE FUNCTION f (y) = max x y T x f(x) Is it convex?

CONJUGATE OF NORM f(x) =kxk dual norm f (y) = max x y T x kxk Holder inequality kyk, max x yt x/kxk y T x applekyk kxk kyk apple 1! f =0 kyk > 1! f = 1 why? why? f (y) = ( 0, kyk apple 1 1, otherwise

EXAMPLES f(x) =kxk 2, f (y) =X 2 (y) = f(x) = x, f (y) =X 1 (y) = f(x) =kxk 1, f (y) =X 1 (y) = ( 0, kxk 2 apple 1 1, kxk 2 > 1 ( 0, kxk 1 apple 1 1, kxk 1 > 1 ( 0, x apple 1 1, x > 1

HOW TO USE CONJUGATE g(x)+f(ax + b) write this as g(x)+f(y) subject to y = Ax + b d( )=min x,y g(x)+f(y)+h, Ax + b yi =min x,y g(x)+h, Axi + f(y) h,yi + h,bi = h,bi max x,y g(x) ha T,xi f(y)+h,yi d( )=h,bi g ( A T ) f ( )

EXAMPLE: LASSO µ x + 1 2 kax bk2 how big? g(x)+f(ax + b) d( )=h,bi g ( A T ) f ( ) note: (µj) (y) =µj (y/µ) maximize h,bi X 1 1 µ AT 1 2 k k2 change variables to eliminate negative sign maximize h,bi X 1 1 µ AT 1 2 k k2

EXAMPLE: LASSO µ x + 1 2 kax bk2 maximize h,bi X 1 1 µ AT 1 2 k k2 complete square 1 maximize 2 k bk2 2 + 1 2 kbk2 subject to ka T k 1 apple µ

EXAMPLE: LASSO µ x + 1 2 kax bk2 dual problem maximize k bk 2 2 + 1 2 kbk2 subject to ka T k 1 apple µ Zero is a solution when the optimal objective is µ 0 + 1 2 ka0 bk2 = 1 2 kbk2 This coincides with dual variable = b µ ka T bk

EXAMPLE: LASSO µ x + 1 2 kax bk2 Solution is zero when µ ka T bk µ guess = 1 10 kat bk

EXAMPLE: SVM g(x)+f(ax) w,b w x = b 1 2 kwk2 + h(ldw lb) A =[LD l] f(z) =h(z) =C X max{1 z i, 0} g(x) =g(w, b) = 1 2 kwk2

HOW TO USE CONJUGATE f(z) =h(z) =C X max{1 z i, 0} f ( ) = maxh,zi h(z) z = maxh,zi C max{0, 1 z} z = max z min{h,zi, h,zi C(1 z)} = max min{h,zi, h + C, zi C} z ( 1 T, i 2 [ C, 0] = 1, otherwise max occurs where two linear functions intersect

HOW TO USE CONJUGATE g(x) =g(w, b) = 1 2 kwk2 g (y) = max x hy, xi 1 2 kwk2 = max hy w,wi + hy b,bi w,b ( 1 2 ky wk 2,y b =0 1 2 kwk2 = 1, otherwise

SVM DUAL g(x)+f(ax) d( )= g ( A T ) f ( ) x = w b A =[LD l] g (y) = ( 1 2 ky wk 2,y b =0 1, otherwise f ( )= ( P i, 2 [ C, 0] 1, otherwise maximize = 1 2 kdt L k 2 1 T subject to l T =0 C apple apple 0 Same as before with

EXAMPLE: TV 1 2 kx fk2 + µ rx g(x)+f(ax) f(z) =µ z! f (y) =X 1 (y/µ) (µj) (y) =µj (y/µ) g(z) = 1 2 kz fk2! g (y) = 1 2 ky + 1 fk2 2 kfk2 form dual problem d( )= g ( A T ) f ( ) maximize 1 2 krt fk 2 X 1 ( /µ)

EXAMPLE: TV maximize 1 2 krt fk 2 X 1 ( /µ) 1 maximize 2 krt fk 2 subject to k k 1 apple µ smooth simple

NON-CONVEX PROBLEM: SPECTRAL CLUSTERING Non-linearly separable classes Two moons Swiss roll

SPECTRAL CLUSTERING Non-linearly separable classes Two moons (Dis)similarity matrix W ij = e kd i d j k 2 / 2 x i 2 { 1, 1}

LABELING PROBLEM x T Wx subject to x 2 i =1 Big W Small W different labels same label Is this convex? Why? How hard is this problem? Dual function d( )= x T Wx+ h,x 2 1i Is this convex? Why? How hard is this problem?

LABELING PROBLEM Dual function d( )= x T Wx+ h,x 2 1i X ix 2 i = x T diag( )x d( )= x T (W + diag( ))x h, 1i ( = h, 1i, W + diag( ) 0 1, otherwise

LABELING PROBLEM Dual function d( )= x T (W + diag( ))x h, 1i ( = h, 1i, W + diag( ) 0 1, otherwise Pick the dual vector to be smallest constant we can get away with i = min(w ), 8i d(x) =h min, 1i smallest eigenvector x = arg min x T (W mini)x + h min, 1i = e min

LABELING PROBLEM Approximate solution x = arg min x T (W mini)x + h min, 1i = e min Final step: integer rounding x? approx = round(e min )

OVERALL STRATEGY x T Wx subject to x 2 i =1 convex relaxation Why approximate? Dual function d( )= x T Wx+ h,x 2 1i Approximate solution x = arg min x T (W mini)x + h min, 1i = e min Final step: integer rounding x? approx = round(e min ) note: we could also definite SIMilarity matrix, and use largest eigenvalue

DO CODE EXAMPLE spectral clustering

NEWTON S METHOD smooth problem f(x) quadratic Approximation 1 2 (x xk ) T H(x x k )+hx x k,gi What about a constrained problem? f(x) subject to Ax = b

CONSTRAINED NEWTON Constrained problem f(x) subject to Ax = b 1 2 (x xk ) T H(x x k )+hx x k,gi subject to Ax = b L(x, )= 1 2 (x xk ) T H(x x k )+hx x k,gi + h, Ax bi KKT Conditions H(x x k )+g + A T =0 Ax = b

CONSTRAINED NEWTON KKT Conditions H(x x k )+g + A T =0 Ax = b Newton direction d = x x k H A T A 0 d = b g Ax k Solution is approximate r Solution satisfies constraints

NEWTON ALGORITHM Compute Hessian and gradient Solve Newton system H A T d A 0 = b g Ax k Armijo search Update iterate f(x k + d) apple f(x k )+ 10 gt d x k+1 = x k + d When stepsize=1, constraints are satisfied exactly

NEWTON COMPLEXITY f(x) subject to Ax = b Theorem Suppose the Hessian of f is Lipschitz continuous. Then after a finitely many constrained Newton steps the unit stepsize is an Armijo step, and ( = 1) kx k+1 x? kappleckx k x? k 2 Furthermore, the constraint are exactly satisfied.