Lecture 3: Dual problems and Kernels

Lecture 3: Dual problems and Kernels C4B Machne Learnng Hlary 211 A. Zsserman Prmal and dual forms Lnear separablty revsted Feature mappng Kernels for SVMs Kernel trck requrements radal bass functons SVM revew We have seen that for an SVM learnng a lnear classfer w > x + b s formulated as solvng an optmzaton problem over w : mn w R w 2 + C max (, 1 y f(x )) d Ths quadratc optmzaton problem s known as the prmal problem. Instead,theSVMcanbeformulatedtolearnalnearclassfer α y (x > x)+b by solvng an optmzaton problem over α. Ths s know as the dual problem, and we wll look at the advantages of ths formulaton.

Sketch dervaton of dual form The Representer Theorem states that the soluton w can always be wrtten as a lnear combnaton of the tranng data: Proof: see example sheet. w = α j y j x j j=1 Now, substtute for w n w > x + b α j y j x j > ³ x + b = α j y j xj > x + b j=1 j=1 and for w n the cost functon mn w w 2 ³ subject to y w > x + b 1, w 2 X X = α j y j x > j α k y k x k = X α j α k y j y k (x > j x k ) j k jk Hence, an equvalent optmzaton problem s over α j X mn α α j α k y j y k (x > j x k ) subject to y α j y j (x > j x )+b 1, j jk j=1 and a few more steps are requred to complete the dervaton. Prmal and dual formulatons N s number of tranng ponts, and d s dmenson of feature vector x. Prmal problem: for w R d mn w R w 2 + C d max (, 1 y f(x )) Dual problem: for α R N (stated wthout proof): max α X α 1 X α j α k y j y k (x > j x k ) subject to α C for, and X 2 jk α y = Complexty of soluton s O(d 3 ) for prmal, and O(N 3 )fordual If N<<dthen more effcent to solve for α than w Dual form only nvolves (x j > x ). We wll return to why ths s an advantage whenwelookatkernels.

Prmal and dual formulatons Prmal verson of classfer: w > x + b Dual verson of classfer: α y (x > x)+b At frst sght the dual form appears to have the dsadvantage of a K-NN classfer t requres the tranng data ponts x. However, many of the α s are zero. The ones that are non-zero defne the support vectors x. Support Vector Machne w T x + b = b w Support Vector Support Vector w X α y (x > x)+b support vectors

Handlng data that s not lnearly separable ntroduce slack varables mn w R d,ξ R w 2 + C + subject to ξ y ³ w > x + b 1 ξ for =1...N lnear classfer not approprate?? Soluton 1: use polar coordnates r θ < > θ r Data s lnearly separable n polar coordnates Acts non-lnearly n orgnal space!! Φ : Ã x1 x 2 Ã r θ R 2 R 2

Soluton 2: map data to hgher dmenson Φ : Ã x1 x 2! x 2 1 x 2 2 2x1 x 2 R 2 R 3 Z = 2x 1 x 2 Y = x 2 2 X = x 2 1 Data s lnearly separable n 3D Ths means that the problem can stll be solved by a lnear classfer SVM classfers n a transformed feature space R d R D Φ Φ : x Φ(x) R d R D Learn classfer lnear n w for R D : w > Φ(x) + b

Prmal Classfer n transformed feature space Classfer, wthw R D : Learnng, forw R D w > Φ(x) + b mn w R w 2 + C D max (, 1 y f(x )) Smply map x to Φ(x) where data s separable Solve for w n hgh dmensonal space R D Complexty of soluton s now O(D 3 ) rather than O(d 3 ) Classfer: Dual Classfer n transformed feature space Learnng: subject to X max α max α X α 1 2 α 1 2 α y x > x + b α y Φ(x ) > Φ(x) + b X jk X jk α j α k y j y k x j > x k α j α k y j y k Φ(x j ) > Φ(x k ) α C for, and X α y =

Dual Classfer n transformed feature space Note, that Φ(x) only occurs n pars Φ(x j ) > Φ(x ) Once the scalar products are computed, complexty s agan O(N 3 ); t s not necessary to learn n the D dmensonal space, as t s for the prmal Wrte k(x j, x )=Φ(x j ) > Φ(x ). Ths s known as a Kernel Classfer: Learnng: subject to max α X α 1 X 2 jk α y k(x, x) + b α j α k y j y k k(x j, x k ) α C for, and X α y = Specal transformatons Φ : Ã x1 x 2! x 2 1 x 2 2 2x1 x 2 R 2 R 3 Φ(x) > Φ(z) = ³ x 2 1,x2 2, 2x 1 x 2 Kernel Trck z 2 1 z 2 2 2z1 z 2 = x 2 1 z2 1 + x2 2 z2 2 +2x 1x 2 z 1 z 2 = (x 1 z 1 + x 2 z 2 ) 2 = (x > z) 2 Classfer can be learnt and appled wthout explctly computng Φ(x) All that s requred s the kernel k(x, z) =(x > z) 2 Complexty s stll O(N 3 )

Example kernels Lnear kernels k(x, x )=x > x Polynomal kernels k(x, x )= ³ 1+x > x d for any d> Contans all polynomals terms up to degree d Gaussan kernels k(x, x ) = exp ³ x x 2 /2σ 2 for σ > Infnte dmensonal feature space Vald kernels when can the kernel trck be used? Gven some arbtrary functon k(x, x j ), how do we know f t corresponds to a scalar product Φ(x ) > Φ(x j )nsome space? Mercer kernels: f k(, )satsfes: Symmetrc k(x, x j )=k(x j, x ) Postve defnte, α > Kα forallα R N,whereK s the N N Gram matrx wth entres K j = k(x, x j ). then k(, ) s a vald kernel. e.g. k(x, z) =x > z s a vald kernel, k(x, z) =x x > z s not.

SVM classfer wth Gaussan kernel α y k(x, x)+b N = sze of tranng data weght (may be zero) support vector Gaussan kernel k(x, x )=exp ³ x x 2 /2σ 2 Radal Bass Functon (RBF) SVM α y exp ³ x x 2 /2σ 2 + b RBF Kernel SVM Example.6.4 feature y.2 -.2 -.4 -.6 -.8 -.6 -.4 -.2.2.4.6.8 1 feature x data s not lnearly separable n orgnal feature space

σ =1. C = 1 1 α y exp ³ x x 2 /2σ 2 + b σ =1. C = 1 Decrease C, gves wder (soft) margn

σ =1. C =1 α y exp ³ x x 2 /2σ 2 + b σ =1. C = α y exp ³ x x 2 /2σ 2 + b

σ =.25 C = Decrease sgma, moves towards nearest neghbour classfer σ =.1 C = α y exp ³ x x 2 /2σ 2 + b

-6-4 -2 Kernel block structure N N Gram matrx wth entres K j = k(x, x j ) lnear kernel (C =.1) RBF kernel (C = 1, gamma =.25) -6-4 -2 pos. vec. neg. vec. supp. vec. margn vec. decson bound. pos. margn neg. margn 2 2 4 4 6 6-6 -4-2 2 4 6 Gram matrx lnear kernel 5 1 15 2 25 3 1 2 3 2 15 1 5-5 -1-15 -6-4 -2 2 4 6 5 1 15 2 25 3 Gram matrx RBF kernel 1 2 3 1.8.6.4.2 The kernel measures smlarty between the ponts Kernel Trck - Summary Classfers can be learnt for hgh dmensonal features spaces, wthout actually havng to map the ponts nto the hgh dmensonal space Data may be lnearly separable n the hgh dmensonal space, but not lnearly separable n the orgnal feature space Kernels can be used for an SVM because of the scalar product n the dual form, but can also be used elsewhere they are not ted to the SVM formalsm Kernels apply also to objects that are not vectors, e.g. k(h, h )= P k mn(h k,h k ) for hstograms wth bns h k,h k We wll see other examples of kernels later n regresson and unsupervsed learnng

Background readng Bshop, chapters 6.2 and 7 Haste et al, chapter 12 More on web page: http://www.robots.ox.ac.uk/~az/lectures/ml