Lagrange Multipliers Kernel Trick

Lagrange Multplers Kernel Trck Ncholas Ruozz Unversty of Texas at Dallas Based roughly on the sldes of Davd Sontag

General Optmzaton A mathematcal detour, we ll come back to SVMs soon! subject to: f x 0, h x = 0, mn f 0(x) x R n = 1,, m = 1,, p 2

General Optmzaton subject to: f x 0, h x = 0, mn f 0(x) x R n = 1,, m = 1,, p f 0 s not necessarly convex 3

General Optmzaton subject to: f x 0, h x = 0, mn f 0(x) x R n = 1,, m = 1,, p Constrants do not need to be lnear 4

Lagrangan L x, λ, ν = f 0 x + m =1 λ f x + p =1 ν h (x) Incorporate constrants nto a new objectve functon λ 0 and ν are vectors of Lagrange multplers The Lagrange multplers can be thought of as soft constrants 5

Dualty Construct a dual functon by mnmzng the Lagrangan over the prmal varables g λ, ν = nf x L(x, λ, ν) g λ, ν = whenever the Lagrangan s not bounded from below for a fxed λ and ν 6

The Prmal Problem subject to: Equvalently, f x 0, h x = 0, nf x mn f 0(x) x R n sup λ 0,ν = 1,, m = 1,, p L(x, λ, ν) 7

The Dual Problem Equvalently, sup λ 0,ν sup λ 0,ν nf x g(λ, ν) L(x, λ, ν) The dual problem s always concave, even f the prmal problem s not convex 8

Prmal vs. Dual sup λ 0,ν nf x L(x, λ, ν) nf x sup λ 0,ν L(x, λ, ν) Why? g λ, ν L(x, λ, ν) for all x L x, λ, ν f 0 (x ) for any feasble x, λ 0 x s feasble f t satsfes all of the constrants Let x be the optmal soluton to the prmal problem and λ 0 g λ, ν L x, λ, ν f 0 x 9

Dualty Under certan condtons, the two optmzaton problems are equvalent sup λ 0,ν nf x L(x, λ, ν) = nf x Ths s called strong dualty sup λ 0,ν L(x, λ, ν) If the nequalty s strct, then we say that there s a dualty gap Sze of gap measured by the dfference between the two sdes of the nequalty 10

Slater s Condton For any optmzaton problem of the form subject to: mn f 0(x) x R n f x 0, = 1,, m Ax = b where f 0,, f m are convex functons, strong dualty holds f there exsts an x such that f x < 0, = 1,, m Ax = b 11

Dual SVM such that mn w 1 2 w 2 y w T x + b 1, for all Note that Slater s condton holds as long as the data s lnearly separable 12

Dual SVM L w, b, λ = 1 2 wt w + λ (1 y (w T x + b)) Convex n w, so take dervatves to form the dual L w k = w k + L b = λ y x k () = 0 λ y = 0 13

Dual SVM L w, b, λ = 1 2 wt w + λ (1 y (w T x + b)) Convex n w, so take dervatves to form the dual w = λ y x () λ y = 0 14

Dual SVM such that max λ 0 1 2 j λ λ j y y j x T x j + λ λ y = 0 By strong dualty, solvng ths problem s equvalent to solvng the prmal problem Gven the optmal λ, we can easly construct w (b can be found by complementary slackness) 15

Complementary Slackness Suppose that there s zero dualty gap Let x be an optmum of the prmal and (λ, ν ) be an optmum of the dual f 0 x = g λ, ν = nf x f 0 x + f 0 x + = f 0 x + m =1 m =1 m =1 λ f x + λ f x + λ f x p =1 p =1 ν h (x) ν h x f 0 x 16

Complementary Slackness Ths means that m =1 λ f x = 0 As λ 0 and f x, ths can only happen f λ f x = 0 for all Put another way, If f x < 0 (.e., the constrant s not tght), then λ = 0 If λ > 0, then f (x ) = 0 ONLY apples when there s no dualty gap 17

Dual SVM such that max λ 0 1 2 j λ λ j y y j x T x j + λ λ y = 0 By complementary slackness, λ > 0 means that x () s a support vector (can then solve for b usng w) 18

Dual SVM such that max λ 0 1 2 j λ λ j y y j x T x j + λ λ y = 0 Takes O(n 2 ) tme just to evaluate the objectve functon Actve area of research to try to speed ths up 19

The Kernel Trck such that max λ 0 1 2 j λ λ j y y j x T x j + λ λ y = 0 The dual formulaton only depends on nner products between the data ponts Same thng s true f we use feature vectors nstead 20

The Kernel Trck For some feature vectors, we can compute the nner products quckly, even f the feature vectors are very large Ths s best llustrated by example Let φ x 1, x 2 = x 1 x 2 x 2 x 1 x 1 2 x 2 2 φ x 1, x 2 φ z 1, z 2 = x 1 2 z 1 2 + 2x 1 x 2 z 1 z 2 + x 2 2 z 2 2 = x 1 z 1 + x 2 z 2 2 = x z 2 Reduces to a dot product n the orgnal space 21

The Kernel Trck The same dea can be appled for the feature vector φ of all polynomals of degree (exactly) d φ x φ z = x z d More generally, a kernel s a functon k x, z = φ x φ(z) for some feature map φ Rewrte the dual objectve max 1 λ λ 0, λ y =0 2 λ j y y j k(x (), x j ) + j λ 22

Examples of Kernels Polynomal kernel of degree exactly d k x, z = x z d General polynomal kernel of degree d for some c k x, z = x z + c d Gaussan kernel for some σ k x, z = exp x z 2 2σ 2 The correspondng φ s nfnte dmensonal! So many more 23

Kernels Bgger feature space ncreases the possblty of overfttng Large margn solutons should stll generalze reasonably well Alternatve: add penaltes to the objectve to dsncentvze complcated solutons mn w 1 2 w 2 + c (# of msclassfcatons) Not a quadratc program anymore (n fact, t s NP-hard) Smlar problem to Hammng loss, no noton of how badly the data s msclassfed 24