WHY DUALITY? Gradient descent Newton s method Quasi-newton Conjugate gradients. No constraints. Non-differentiable ???? Constrained problems? ????

Size: px

Start display at page:

Download "WHY DUALITY? Gradient descent Newton s method Quasi-newton Conjugate gradients. No constraints. Non-differentiable ???? Constrained problems? ????"

Abel Kennedy
6 years ago
Views:

1 DUALITY

2 WHY DUALITY? No constraints f(x) Non-differentiable f(x) Gradient descent Newton s method Quasi-newton Conjugate gradients etc???? Constrained problems? f(x) subject to g(x) apple 0???? h(x) =0

3 LAGRANGIAN Simple case f(x) subject to Ax b =0 Saddle-point form min x max f(x)+h, Ax bi Lagrangian

4 SADDLE-POINT FORM Simple case f(x) subject to Ax b =0 max f(x)+h, Ax bi = Saddle-point form min x max f(x)+h, Ax bi Lagrangian ( f(x), Ax b =0 1, otherwise min max f(x)+h, Ax bi =min x x f(x)

5 INEQUALITY CONSTRAINTS f(x) subject to Ax b =0 Cx d apple 0 min x max, 0 f(x)+h, Ax bi + h,cx di non-negative constraint Why does this work? Cx d apple 0: =0

6 GENERAL FORM LAGRANGIAN f(x) subject to g(x) apple 0 h(x) =0 Lagrangian L(x,, ) =f(x)+h,h(x)i + h,g(x)i Saddle-point formulation f(x? )=min x max, 0 f(x)+h,h(x)i + h,g(x)i

7 CALCULUS INTERPRETATION f(x) subject to h(x) =0 gradient of objective parallel to gradient of constraint rf(x) =rh(x) rf(x)+rh(x) =0 contours of objective linear constraint

8 CALCULUS INTERPRETATION f(x) subject to h(x) =0 gradient of objective parallel to gradient of constraint rf(x) =rh(x) rf(x)+rh(x) =0 L(x, )=f(x)+h,h(x)i min x optimality x L(x, )=rf(x)+rh(x) =0

9 INEQUALITY CONDITIONS f(x) subject to g(x) apple 0 L(x, ) =f(x)+h,g(x)i Inactive: min x max 0 f(x)+h,g(x)i =0 Active: > 0 x? x?

10 INEQUALITY x L(x, )=rf(x)+rg(x) =0 Inactive: =0 Active: > 0

11 OPTIMALITY CONDITIONS: KKT SYSTEM f(x) subject to g(x) apple 0 h(x) =0 L(x,, ) =f(x)+h,h(x)i + h,g(x)i necessary conditions Karush-Kuhn-Tucker rf(x? )+ rh(x? )+ rg(x? )=0 Primal/dual optimality g(x? ) apple 0 h(x? )=0 0 i g i (x? )=0 } Primal feasibility Dual feasibility Comp slackness

12 DUAL FUNCTION d(, ) =min x L(x,, ) =min x f(x)+h,h(x)i + h,g(x)i Dual is concave? Why? Does this depend on convexity of f?

13 DUAL FUNCTION d(, ) =min x L(x,, ) =min x f(x)+h,h(x)i + h,g(x)i Dual is lower bound to optimal objective min x f(x)+h,h(x)i + h,g(x)i applef(x? )? Why? (because optimal x satisfies constraints)

14 GEOMETRIC INTERPRETATION OF LOWER BOUND f(x) subject to g(x) apple 0 h(x) =0 Implicit constraints f(x)+x 0 (h(x)) + X (g(x)) X (z) = ( 0, z apple 0 1, otherwise X 1 0

15 GEOMETRIC INTERPRETATION f(x) subject to g(x) apple 0 h(x) =0 Implicit constraints f(x)+x 0 (h(x)) + X (g(x)) X 1 linear lower bound X (z) hz, i, 0 0

16 GEOMETRIC INTERPRETATION f(x) subject to g(x) apple 0 h(x) =0 Implicit constraints f(x)+x 0 (h(x)) + X (g(x)) Lower bound f(x)+h,h(x)i + h,g(x)i

17 MAX OF DUAL max, 0 d(, ) = max min, 0 x f(x)+h,h(x)i + h,g(x)i??? L(x,, ) max, 0 d(, ) = max min, 0 x L(x,, ) =min x max, 0 L(x,, ) Solution Does maximizing dual = minimizing primal???

18 WEAK/STRONG DUALITY Weak duality max, 0 d(, ) apple f(x? ) Always holds: even for non-convex problems Strong duality max, 0 d(, ) =f(x? ) Holds most of the time for convex problems

19 SLATER S CONDITION f(x) subject to g(x) apple 0 Ax = b Slater s condition holds if there is a strictly feasible point Constraint qualification f(x) < 1 g(x) < 0 Ax = b Not strictly feasible Strictly feasible

20 SLATER S CONDITION f(x) subject to g(x) apple 0 Ax = b Slater s condition holds if there is a strictly feasible point f(x) < 1 g(x) < 0 Theorem Ax = b Convex+(linear equalities)+(slater s condition) = strong duality max min, 0 x L(x,, ) =min x max, 0 L(x,, ) max, 0 d(, ) =f(x? )

21 NON-HOMOGENOUS SVM w,b Choose line with largest margin: 2 kwk Push data to right side of line: 1 2 kwk2 + C X h[l i (d T i w b)] d T w b =1 h(l i [d T i w b]) d T w b =0

22 DUAL: THE HARD WAY w,b 1 2 kwk2 + C X h[l i (d T i w b)] w,b w,b 1 X 2 kwk2 + C max{1 l i (d T i w b), 0} Idea: make this differentiable 1 X 2 kwk2 + C max{1 L(Dw b), 0} w,b,x Standard form 1 2 kwk2 + Ch1,xi subject to x 1 L(Dw b) x 0

23 DUAL: THE HARD WAY w,b,x 1 2 kwk2 + Ch1,xi subject to x 1 L(Dw b) x 0 Positive multipliers 1 d(, )=min w,b,x2 kwk2 + Ch1,xi + h, 1 L(Dw b) xi + h, xi This is too complicated. Let s reduce it derivative for x C1 =0, with, 0 solution = C1, for 0 apple i apple C new constraint

24 DUAL: THE HARD WAY w,b,x 1 2 kwk2 + Ch1,xi subject to x 1 L(Dw b) x 0 1 d(, )=min w,b,x2 kwk2 + Ch1,xi + h, 1 L(Dw b) xi + h, xi hc1, xi d( ) =min w,b,x =min w,b,x = C1, for 0 apple i apple C 1 2 kwk2 + h, 1 L(Dw b)i 1 2 kwk2 + h, 1i h,ldwi h,lib

25 DUAL: THE HARD WAY d( ) =min w,b,x =min w,b,x w 1 2 kwk2 + h, 1 L(Dw b)i 1 2 kwk2 + h, 1i h,ldwi h,lib h,li =0 for w D T L =0, or w = D T L d( ) = 1 2 kdt L k 2 + h, 1i h,ldd T L i = 1 2 kdt L k 2 + h, 1i kd T L k 2 d( ) = 1 2 kdt L k 2 + h, 1i

26 ADVANTAGES OF DUAL w,b w = D T L Original problem 1 2 kwk2 + C max{1 L(Dw b), 0} non-differentiable Dual problem maximize d( ) = subject to 0 apple apple C h,li =0 1 2 kdt L k 2 + h, 1i smooth Solvable via coordinate descent LIBSVM

27 ADVANTAGES OF DUAL Dual problem maximize d( ) = 1 2 kdt L k 2 + h, 1i subject to 0 apple apple C h,li =0 kernel form d( ) = 1 2 T LDD T L kernel h, 1i subject to 0 apple apple C h,li =0

28 KERNELS (DD T ) ij = hd i,d j i = ( big, when d i and d j are close small, when d i and d j are far apart k(d i,d j )= kernel function ( big, when d i and d j are similar small, when d i and d j are di erent Dual SVM: only need to know similarity function Kernel methods: replace inner product with some other similarity

29 EXAMPLE: TWO MOONS k(d i,d j )= kernel function ( big, when d i and d j are similar small, when d i and d j are di erent far close

30 EXAMPLE: TWO MOONS k(d i,d j )= kernel function ( big, when d i and d j are similar small, when d i and d j are di erent k(x, y) =exp kx yk far close

31 EXAMPLE: TWO MOONS subject to 0 apple apple C d( ) = 1 2 T LDD T L h,li =0 kernel h, 1i K ij = k(d i,d j ) d( ) = 1 2 T LKL h, 1i subject to 0 apple apple C h,li =0 Return to this later!!

32 CONJUGATE FUNCTION f (y) = max x y T x f(x) Is it convex?

33 CONJUGATE OF NORM f(x) =kxk dual norm f (y) = max x y T x kxk Holder inequality kyk, max x yt x/kxk y T x applekyk kxk kyk apple 1! f =0 kyk > 1! f = 1 why? why? f (y) = ( 0, kyk apple 1 1, otherwise

34 EXAMPLES f(x) =kxk 2, f (y) =X 2 (y) = f(x) = x, f (y) =X 1 (y) = f(x) =kxk 1, f (y) =X 1 (y) = ( 0, kxk 2 apple 1 1, kxk 2 > 1 ( 0, kxk 1 apple 1 1, kxk 1 > 1 ( 0, x apple 1 1, x > 1

35 HOW TO USE CONJUGATE g(x)+f(ax + b) write this as g(x)+f(y) subject to y = Ax + b d( )=min x,y g(x)+f(y)+h, Ax + b yi =min x,y g(x)+h, Axi + f(y) h,yi + h,bi = h,bi max x,y g(x) ha T,xi f(y)+h,yi d( )=h,bi g ( A T ) f ( )

36 EXAMPLE: LASSO µ x kax bk2 how big? g(x)+f(ax + b) d( )=h,bi g ( A T ) f ( ) note: (µj) (y) =µj (y/µ) maximize h,bi X 1 1 µ AT 1 2 k k2 change variables to eliminate negative sign maximize h,bi X 1 1 µ AT 1 2 k k2

37 EXAMPLE: LASSO µ x kax bk2 maximize h,bi X 1 1 µ AT 1 2 k k2 complete square 1 maximize 2 k bk kbk2 subject to ka T k 1 apple µ

38 EXAMPLE: LASSO µ x kax bk2 dual problem maximize k bk kbk2 subject to ka T k 1 apple µ Zero is a solution when the optimal objective is µ ka0 bk2 = 1 2 kbk2 This coincides with dual variable = b µ ka T bk

39 EXAMPLE: LASSO µ x kax bk2 Solution is zero when µ ka T bk µ guess = 1 10 kat bk

40 EXAMPLE: SVM g(x)+f(ax) w,b w x = b 1 2 kwk2 + h(ldw lb) A =[LD l] f(z) =h(z) =C X max{1 z i, 0} g(x) =g(w, b) = 1 2 kwk2

41 HOW TO USE CONJUGATE f(z) =h(z) =C X max{1 z i, 0} f ( ) = maxh,zi h(z) z = maxh,zi C max{0, 1 z} z = max z min{h,zi, h,zi C(1 z)} = max min{h,zi, h + C, zi C} z ( 1 T, i 2 [ C, 0] = 1, otherwise max occurs where two linear functions intersect

42 HOW TO USE CONJUGATE g(x) =g(w, b) = 1 2 kwk2 g (y) = max x hy, xi 1 2 kwk2 = max hy w,wi + hy b,bi w,b ( 1 2 ky wk 2,y b =0 1 2 kwk2 = 1, otherwise

43 SVM DUAL g(x)+f(ax) d( )= g ( A T ) f ( ) x = w b A =[LD l] g (y) = ( 1 2 ky wk 2,y b =0 1, otherwise f ( )= ( P i, 2 [ C, 0] 1, otherwise maximize = 1 2 kdt L k 2 1 T subject to l T =0 C apple apple 0 Same as before with

44 EXAMPLE: TV 1 2 kx fk2 + µ rx g(x)+f(ax) f(z) =µ z! f (y) =X 1 (y/µ) (µj) (y) =µj (y/µ) g(z) = 1 2 kz fk2! g (y) = 1 2 ky + 1 fk2 2 kfk2 form dual problem d( )= g ( A T ) f ( ) maximize 1 2 krt fk 2 X 1 ( /µ)

45 EXAMPLE: TV maximize 1 2 krt fk 2 X 1 ( /µ) 1 maximize 2 krt fk 2 subject to k k 1 apple µ smooth simple

46 NON-CONVEX PROBLEM: SPECTRAL CLUSTERING Non-linearly separable classes Two moons Swiss roll

47 SPECTRAL CLUSTERING Non-linearly separable classes Two moons (Dis)similarity matrix W ij = e kd i d j k 2 / 2 x i 2 { 1, 1}

48 LABELING PROBLEM x T Wx subject to x 2 i =1 Big W Small W different labels same label Is this convex? Why? How hard is this problem? Dual function d( )= x T Wx+ h,x 2 1i Is this convex? Why? How hard is this problem?

49 LABELING PROBLEM Dual function d( )= x T Wx+ h,x 2 1i X ix 2 i = x T diag( )x d( )= x T (W + diag( ))x h, 1i ( = h, 1i, W + diag( ) 0 1, otherwise

50 LABELING PROBLEM Dual function d( )= x T (W + diag( ))x h, 1i ( = h, 1i, W + diag( ) 0 1, otherwise Pick the dual vector to be smallest constant we can get away with i = min(w ), 8i d(x) =h min, 1i smallest eigenvector x = arg min x T (W mini)x + h min, 1i = e min

51 LABELING PROBLEM Approximate solution x = arg min x T (W mini)x + h min, 1i = e min Final step: integer rounding x? approx = round(e min )

52 OVERALL STRATEGY x T Wx subject to x 2 i =1 convex relaxation Why approximate? Dual function d( )= x T Wx+ h,x 2 1i Approximate solution x = arg min x T (W mini)x + h min, 1i = e min Final step: integer rounding x? approx = round(e min ) note: we could also definite SIMilarity matrix, and use largest eigenvalue

53 DO CODE EXAMPLE spectral clustering

54 NEWTON S METHOD smooth problem f(x) quadratic Approximation 1 2 (x xk ) T H(x x k )+hx x k,gi What about a constrained problem? f(x) subject to Ax = b

55 CONSTRAINED NEWTON Constrained problem f(x) subject to Ax = b 1 2 (x xk ) T H(x x k )+hx x k,gi subject to Ax = b L(x, )= 1 2 (x xk ) T H(x x k )+hx x k,gi + h, Ax bi KKT Conditions H(x x k )+g + A T =0 Ax = b

56 CONSTRAINED NEWTON KKT Conditions H(x x k )+g + A T =0 Ax = b Newton direction d = x x k H A T A 0 d = b g Ax k Solution is approximate r Solution satisfies constraints

57 NEWTON ALGORITHM Compute Hessian and gradient Solve Newton system H A T d A 0 = b g Ax k Armijo search Update iterate f(x k + d) apple f(x k )+ 10 gt d x k+1 = x k + d When stepsize=1, constraints are satisfied exactly

58 NEWTON COMPLEXITY f(x) subject to Ax = b Theorem Suppose the Hessian of f is Lipschitz continuous. Then after a finitely many constrained Newton steps the unit stepsize is an Armijo step, and ( = 1) kx k+1 x? kappleckx k x? k 2 Furthermore, the constraint are exactly satisfied.

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization x = prox_f(x)+prox_{f^*}(x) use to get prox of norms! PROXIMAL METHODS WHY PROXIMAL METHODS Smooth