In SMV I IAML: Supprt Vectr Machines II Nigel Gddard Schl f Infrmatics Semester 1 We sa: Ma margin trick Gemetry f the margin and h t cmpute it Finding the ma margin hyperplane using a cnstrained ptimizatin prblem Ma margin = Min nrm This Time 1 / 25 The SVM ptimizatin prblem 2 / 25 Last time: the ma margin eights can be cmputed by slving a cnstrained ptimizatin prblem Nn separable data The kernel trick min 2 s.t. y i ( i + 0 ) +1 fr all i Many algrithms have been prpsed t slve this. One f the earliest efficient algrithms is called SMO [Platt, 1998]. This is utside the scpe f the curse, but it des eplain the name f the SVM methd in Weka. 3 / 25 4 / 25
Finding the ptimum Why a slutin f this frm? If yu mve the pints nt n the marginal hyperplanes, slutin desn t change - therefre thse pints dn t matter. If yu g thrugh sme advanced maths (Lagrange multipliers, etc.), it turns ut that yu can sh smething remarkable. Optimal parameters lk like = i α i y i i Furthermre, slutin is sparse. Optimal hyperplane is determined by just a fe eamples: call these supprt vectrs ~ margin 5 / 25 6 / 25 Finding the ptimum Nn-separable training sets 5 / 18 If yu g thrugh sme advanced maths (Lagrange multipliers, etc.), it turns ut that yu can sh smething remarkable. Optimal parameters lk like = i α i y i i If data set is nt linearly separable, the ptimizatin prblem that e have given has n slutin. Furthermre, slutin is sparse. Optimal hyperplane is determined by just a fe eamples: call these supprt vectrs α i = 0 fr nn-supprt patterns Optimizatin prblem t find α i has n lcal minima (like lgistic regressin) Predictin n ne data pint Why? min 2 s.t. y i ( i + 0 ) +1 fr all i f () = sign(( ) + 0 ) = sign( n α i y i ( i ) + 0 ) 7 / 25 8 / 25
Nn-separable training sets If data set is nt linearly separable, the ptimizatin prblem that e have given has n slutin. min 2 s.t. y i ( i + 0 ) +1 fr all i! Why? Slutin: Dn t require that e classify all pints crrectly. All the algrithm t chse t ignre sme f the pints. This is bviusly dangerus (hy nt ignre all f them?) s e need t give it a penalty fr ding s. ~ margin 9 / 18 9 / 25 10 / 25 Slack Think abut ridge regressin again Slutin: Add a slack variable ξ i 0 fr each training eample. If the slack variable is high, e get t rela the cnstraint, but e pay a price Ne ptimizatin prblem is t minimize n 2 + C( ξi k ) subject t the cnstraints i + 0 1 ξ i fr y i = +1 i + 0 1 + ξ i fr y i = 1 Usually set k = 1. C is a trade-ff parameter. Large C gives a large penalty t errrs. Slutin has same frm, but supprt vectrs als include all here ξ i 0. Why? 11 / 25 Our ma margin + slack ptimizatin prblem is t minimize: n 2 + C( ξ i ) k subject t the cnstraints i + 0 1 ξ i fr y i = +1 i + 0 1 + ξ i fr y i = 1 This lks a even mre like ridge regressin than the nn-slack prblem: C( n ξ i) k measures h ell e fit the data 2 penalizes eight vectrs ith a large nrm S C can be vieed as a regularizatin parameters, like λ in ridge regressin r regularized lgistic regressin Yu re alled t make this tradeff even hen the data set is separable! 12 / 25
15 / 25 16 / 25 Why yu might ant slack in a separable data set Nn-linear SVMs 2 1 2 1 SVMs can be made nnlinear just like any ther linear algrithm e ve seen (i.e., using a basis epansin) But in an SVM, the basis epansin is implemented in a very special ay, using smething called a kernel The reasn fr this is that kernels can be faster t cmpute ith if the epanded feature space is very high dimensinal (even infinite)! This is a fairly advanced tpic mathematically, s e ill just g thrugh a high-level versin 13 / 25 14 / 25 Kernel Nn-linear SVMs A kernel is in sme sense an alternate API fr specifying t the classifier hat yur epanded feature space is. Up t n, e have alays given the classifier a ne set f training vectrs φ( i ) fr all i, e.g., just as a list f numbers. φ : R d R D If D is large, this ill be epensive; if D is infinite, this ill be impssible Transfrm t φ() Linear algrithm depends nly n i. Hence transfrmed algrithm depends nly n φ() φ( i ) Use a kernel functin k( i, j ) such that k( i, j ) = φ( i ) φ( j ) (This is called the kernel trick, and can be used ith a ide variety f learning algrithms, nt just ma margin.)
Eample f kernel 19 / 25 Kernels, dt prducts, and distance 2013 / 25/ 18 Eample 1: fr 2-d input space then φ( i ) = 2 i,1 2i,1 i,2 2 i,2 k( i, j ) = ( i j ) 2 The Euclidean distance squared beteen t vectrs can be cmputed using dt prducts d( 1, 2 ) = ( 1 2 ) T ( 1 2 ) = T 1 1 2 T 1 2 + T 2 2 Using a linear kernel k( 1, 2 ) = T 1 2 e can rerite this as d( 1, 2 ) = k( 1, 1 ) 2k( 1, 2 ) + k( 2, 2 ) Any kernel gives yu an assciated distance measure this ay. Think f a kernel as an indirect ay f specifying distances. Supprt Vectr Machine 17 / 25 Applicatins Predictin n ne eample 18 / 25 A supprt vectr machine is a kernelized maimum margin classifier. Fr ma margin remember that e had the magic prperty f()= sgn (! + b) classificatin f()= sgn (! $ i.k(, i) + b) α i y i i $ 1 $ 2 $ 3 $ 4 eights = i This means e uld predict the label f a test eample as ŷ = sign[ T + 0 ] = sign[ α i y i T i + 0 ] i k k k k cmparisn: k(, i), e.g. supprt vectrs 1... 4 k(, i)=(. i) d k(, i)=ep(!! i 2 / c) k(, i)=tanh((. i)+#) Kernelizing this e get input vectr ŷ = sign[ i α i y i k( i, ) + b] Figure Credit: Bernhard Schelkpf Figure Credit: Bernhard Schelkpf
23 / 25 24 / 25 input space feature space Chsing φ, C!!!!! Figure Credit: Bernhard Schelkpf Figure Credit: Bernhard Schelkpf Eample 2 Eample 2 k( i, j )=ep i j 2 /α 2 In this case the k( dimensin i, j ) = ep f φ is infinite i j 2 /α 2 InT this test case a ne theinput dimensin f φ is infinite. i.e., It can be shn that n φ that maps n int a finite-dimensinal space ill give yu this f () kernel. =sgn( α i y i k( i, )+ 0 ) We can never calculate φ(), but the algrithm nly needs us t calculate k fr different pairs f pints. 11 / 18 There are theretical results, but e ill nt cver them. (If yu ant t lk them up, there are actually upper bunds n the generalizatin errr: lk fr VC-dimensin and structural risk minimizatin.) Hever, in practice crss-validatin methds are cmmnly used Eample applicatin 21 / 25 Cmparisn ith linear and lgistic regressin 22 / 25 US Pstal Service digit data (7291 eamples, 16 16 images). Three SVMs using plynmial, RBF and MLP-type kernels ere used (see Schölkpf and Smla, Learning ith Kernels, 2002 fr details) Use almst the same ( 90%) small sets (4% f data base) f SVs All systems perfrm ell ( 4% errr) Many ther applicatins, e.g. Tet categrizatin Face detectin DNA analysis Underlying basic idea f linear predictin is the same, but errr functins differ Lgistic regressin (nn-sparse) vs SVM ( hinge lss, sparse slutin) Linear regressin (squared errr) vs ɛ-insensitive errr Linear regressin and lgistic regressin can be kernelized t
SVM summary SVMs are the cmbinatin f ma-margin and the kernel trick Learn linear decisin bundaries (like lgistic regressin, perceptrns) Pick hyperplane that maimizes margin Use slack variables t deal ith nn-separable data Optimal hyperplane can be ritten in terms f supprt patterns Transfrm t higher-dimensinal space using kernel functins Gd empirical results n many prblems Appears t avid verfitting in high dimensinal spaces (cf regularizatin) Srry fr all the maths! 25 / 25