1 / 22 IAML: Supprt Vectr Machines Charles Suttn and Victr Lavrenk Schl f Infrmatics Semester 1
2 / 22 Outline Separating hyperplane with maimum margin Nn-separable training data Epanding the input int a high-dimensinal space Supprt vectr regressin Reading: W & F sec 6.3 (maimum margin hyperplane, nnlinear class bundaries), SVM handut. SV regressin nt eaminable.
Overview 3 / 22 Supprt vectr machines are ne f the mst effective and widely used classificatin algrithms. SVMs are the cmbinatin f tw ideas Maimum margin classificatin The kernel trick SVMs are a linear classifier, like lgistic regressin
Recall: Dt Prducts w is length f the prjectin f nt w (if w is a unit vectr) (If yu d nt remember this, see supplementary maths ntes n curse Web site.) 4 / 22
Separating Hyperplane Separating Hyperplane Fr any linear classifier Training instances ( i, y i ), i = 1,..., n. i { 1, +1} Training instances ( i, y i ), i = 1,...,n. y i { 1, +1} Hyperplane Hyperplane w w. + w 0 0 = 0 ~ (w.) + w > 0 0 ~ (w.) + w < 0 0 3 / 18 5 / 22
Maimum margin 6 / 22 Let the perpendicular distance frm the hyperplane t the nearest +1 class pint be d + Similarly fr nearest class 1 pint, perpendicular distance is d Margin is defined as min(d +, d ) Supprt vectr machine algrithm lks fr ( w, w 0 ) that gives rise t the maimum margin At the ma-margin slutin, it must be true that d + = d
Illustratin f the margin 5 / 187 / 22 ~ w margin
Hw t cmpute the margin using dt prducts 8 / 22
9 / 22 Ma-margin as an ptimizatin prblem Our gal will be t cme up with a cnstrained ptimizatin prblem, because then we can use standard technlgy t slve it. (By standard technlgy I mean fancy versins f the algrithms we learned in the ptimizatin lecture.) At a high level, what we want is an ptimizatin prblem that says: Find w with maimum margin subject t the cnstraints that all f the training eamples are classified crrectly. Yu culd try t d this naively, e.g., maimize d + + d, etc. Instead we re ging t d smething a bit mre clever. The reasn is that ptimizers like t see smth, cnve bjective functins and cnstraints. Linear is even better, nn-differentiable is t be avided if pssible.
Our first clever trick 10 / 22 Nte that ( w, w 0 ) and (c w, cw 0 ) defines the same hyperplane. This is like saying a margin 1000mm > 1m Remve rescaling freedm by demanding that min i w i + w 0 = 1 This means that the margin min(d +, d ) = 1, s nw we need t maimize d + + d instead. Nw we have three types f cnstraints n w w i + w 0 0 fr y i = +1 w i + w 0 0 min i w i + w 0 = 1 fr y i = 1 We can simplify these in ne fell swp. These three cnstraints are equivalent t the much simpler y i ( w i + w 0 ) +1 fr all y i
A secnd trick 11 / 22 It turns ut that the margin is 1/ w. Prf: Fr tw pints n bundaries w + + w 0 = 1 w + w 0 = 1 thus w ( + ) = 2 and w ( + ) = 2 w w Nte that we have assumed that the cnstraints f the ptimizatin prblem are satisfied. (We dn t care what the margin is if they aren t, since that wn t be a slutin.)
The SVM ptimizatin prblem Nte that maimizing 2/ w is equivalent t minimizing w 2. S the SVM weights are determined by slving the ptimizatin prblem: min w w 2 s.t. y i ( w i + w 0 ) +1 fr all i 12 / 22
Finding the ptimum Optimal hyperplane can be cmputed frm a quadratic prgramming prblem using Lagrange multipliers w = i α i y i i This uses fancy numerical techniques frm ptimizatin literature. Optimal hyperplane is determined by just a few eamples: call these supprt vectrs α i = 0 fr nn-supprt patterns Optimizatin prblem has n lcal minima (like lgistic regressin) Predictin n new data pint f () = sgn(( w ) + w 0 ) n = sgn( α i y i ( i ) + w 0 ) i=1 13 / 22
14 / 22 Nn-separable training sets If data set is nt linearly separable, the ptimizatin prblem abve has n slutin. Slutin: Add a slack variable ξ i 0 fr each training eample New ptimizatin prblem is t mimimize subject t the cnstraints w 2 + C( n ξ i ) k i=1 w i + w 0 1 ξ i fr y i = +1 w i + w 0 1 + ξ i fr y i = 1 Usually set k = 1. C is a trade-ff parameter, picked by hand (see belw). Large C gives a large penalty t errrs
15 / 22! ~ w margin 9 / 18
Nn-linear SVMs 16 / 22 Transfrm t φ() Linear algrithm depends nly n i. Hence transfrmed algrithm depends nly n φ() φ( i ) Use a kernel functin k( i, j ) such that k( i, j ) = φ( i ) φ( j ) (This is called the kernel trick, and can be used with a wide variety f learning algrithms, nt just ma margin.) Eample 1: fr 2-d input space φ() = 2 1 21 2 2 2 with k( i, j ) = ( i j ) 2
17 / 22 input space feature space " "!!!!! " " " " Figure Credit: Bernhard Schelkpf Figure Credit: Bernhard Schelkpf Eample 2 Eample 2 k( i, j )=ep i j 2 /α 2 k( i, j ) = ep i j 2 /α 2 In this case the dimensin f φ is infinite In T this test case a new the input dimensin f φ is infinite T test a new input n f () =sgn( n α i y i k( i, )+w 0 ) f () = sgn( i=1 α i y i k( i, ) + w 0 ) i=1 11 / 18
Predictin Applicatins new eample f()= sgn (! + b) classificatin f()= sgn (! $ i.k(, i) + b) $ 1 $ 2 $ 3 $ 4 weights k k k k cmparisn: k(, i), e.g. supprt vectrs 1... 4 k(, i)=(. i) d k(, i)=ep(!! i 2 / c) k(, i)=tanh("(. i)+#) input vectr Figure Figure Credit: Credit: Bernhard Bernhard Schelkpf Schelkpf 13 / 18 18 / 22
19 / 22 Chsing φ, C There are theretical results Hwever, in practice crss-validatin methds are cmmnly used
Eample applicatin 20 / 22 US Pstal Service digit data (7291 eamples, 16 16 images). Three SVMs using plynmial, RBF and MLP-type kernels were used (see Schölkpf and Smla, Learning with Kernels, 2002 fr details) Use almst the same ( 90%) small sets (4% f data base) f SVs All systems perfrm well ( 4% errr) Many ther applicatins, e.g. Tet categrizatin Face detectin DNA analysis
21 / 22 Cmparisn with linear and lgistic regressin Underlying basic idea f linear predictin is the same, but errr functins differ Lgistic regressin (nn-sparse) vs SVM ( hinge lss, sparse slutin) Linear regressin (squared errr) vs ɛ-insensitive errr Linear regressin and lgistic regressin can be kernelized t
SVM summary 22 / 22 SVMs are the cmbinatin f ma-margin and the kernel trick Learn linear decisin bundaries (like lgistic regressin, perceptrns) Pick hyperplane that maimizes margin Use slack variables t deal with nn-separable data Optimal hyperplane can be written in terms f supprt patterns Transfrm t higher-dimensinal space using kernel functins Gd empirical results n many prblems Appears t avid verfitting in high dimensinal spaces (cf regularizatin)