Pattern Recognition 2014 Support Vector Machines

Pattern Recgnitin 2014 Supprt Vectr Machines Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 1 / 55

Overview 1 Separable Case 2 Kernel Functins 3 Allwing Errrs (Sft Margin) 4 SVM s in R. Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 2 / 55

Linear Classifier fr tw classes Linear mdel y() = w φ() + b (7.1) with t n { 1, +1}. Predict t 0 = +1 if y( 0 ) 0 and t 0 = 1 therwise. The decisin bundary is given by y() = 0. This is a linear classifier in feature space φ(). Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 3 / 55

Mapping φ y() = w φ() + b = 0 φ maps int higher dimensinal space where data is linearly separable. Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 4 / 55

Data linearly separable Assume training data is linearly separable in feature space, s there is at least ne chice f w, b such that: 1 y( n ) > 0 fr t n = +1; 2 y( n ) < 0 fr t n = 1; that is, all training pints are classified crrectly. Putting 1. and 2. tgether: t n y( n ) > 0 n = 1,..., N Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 5 / 55

Maimum Margin There may be many slutins that separate the classes eactly. Which ne gives smallest predictin errr? SVM chses line with maimal margin, where the margin is the distance between the line and the clsest data pint. Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 6 / 55

Tw-class training data Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 7 / 55

Many Linear Separatrs Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 8 / 55

Decisin Bundary Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 9 / 55

Maimize Margin Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 10 / 55

Supprt Vectrs Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 11 / 55

Weight vectr is rthgnal t the decisin bundary Cnsider tw pints A and B bth f which lie n the decisin surface. Because y( A ) = y( B ) = 0, we have (w A + b) (w B + b) = w ( A B ) = 0 and s the vectr w is rthgnal t the decisin surface. Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 12 / 55

Distance f a pint t a line 2 y() = w + b = 0 r w 1 Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 13 / 55

Distance t decisin surface (φ() = ) We have w = + r w. (4.6) where w w is the unit vectr in the directin f w, is the rthgnal prjectin f nt the line y() = 0, and r is the (signed) distance f t the line. Multiply (4.6) left and right by w and add b: w + b }{{} y() = w + b }{{} 0 +r w w w S we get r = y() w w 2 = y() w (4.7) Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 14 / 55

Distance f a pint t a line The signed distance f n t the decisin bundary is r = y( n) w Fr lines that separate the data perfectly, we have t n y( n ) = y( n ), s that the distance is given by t n y( n ) w = t n(w φ( n ) + b) w (7.2) Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 15 / 55

Maimum margin slutin Slve { } 1 arg ma w,b w min[t n(w φ( n ) + b)]. (7.3) n 1 Since w des nt depend n n, it can be mved utside f the minimizatin. Direct slutin f this prblem wuld be rather cmple. A mre cnvenient representatin is pssible. Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 16 / 55

Cannical Representatin The hyperplane (decisin bundary) is defined by Then als w φ() + b = 0 κ(w φ() + b) = κw φ() + κb = 0 s rescaling w κw and b κb gives just anther representatin f the same decisin bundary. Chse scaling factr such that fr the pint i clsest t the decisin bundary. t i (w φ( i ) + b) = 1 (7.4) Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 17 / 55

Cannical Representatin (square=1,circle= 1) y() = 1 y() = 0 y() = 1 Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 18 / 55

Cannical Representatin In this case we have Quadratic prgram subject t the cnstraints (7.5). t n (w φ( n ) + b) 1 n = 1,..., N (7.5) arg min w,b 1 2 w 2 (7.6) This ptimizatin prblem has a unique glbal minimum. Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 19 / 55

Lagrangian Functin Intrduce Lagrange multipliers a n 0 t get Lagrangian functin L(w, b, a) = 1 N 2 w 2 a n {t n (w φ( n ) + b) 1} (7.7) n=1 with L(w, b, a) w N = w a n t n φ( n ) n=1 Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 20 / 55

Lagrangian Functin and fr b: L(w, b, a) b = N a n t n n=1 Equating the derivatives t zer yields the cnditins: w = N a n t n φ( n ) (7.8) n=1 and N a n t n = 0 (7.9) n=1 Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 21 / 55

Dual Representatin Eliminating w and b frm L(w, b, a) gives the dual representatin. L(w, b, a) = 1 N 2 w 2 a n {t n (w φ( n ) + b) 1} n=1 = 1 N N N 2 w 2 a n t n w φ( n ) b a n t n + = 1 2 n=1 n=1 n=1 N N a n a m t n t m φ( n ) φ( m ) n=1 m=1 N N N a n t n a m t m φ( n ) φ( m ) + = n=1 m=1 N a n 1 2 n=1 n=1 a n N N a n t n a m t m φ( n ) φ( m ) n=1 m=1 a n Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 22 / 55

Dual Representatin Maimize L(a) = N a n 1 N a n t n a m t m φ( n ) φ( m ) (7.10) 2 n=1 n,m=1 with respect t a and subject t the cnstraints a n 0, n = 1,..., N (7.11) N a n t n = 0. (7.12) n=1 Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 23 / 55

Kernel Functin We map t a high-dimensinal space φ() in which data is linearly separable. Perfrming cmputatins in this high-dimensinal space may be very epensive. Use a kernel functin k that cmputes a dt prduct in this space (withut making the actual mapping): k(, ) = φ() φ( ) Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 24 / 55

Eample: plynmial kernel Suppse IR 3 and φ() IR 10 with φ() = (1, 2 1, 2 2, 2 3, 1 2, 2 2, 3 2, 2 1 2, 2 1 3, 2 2 3 ) Then φ() φ(z) = 1 + 2 1 z 1 + 2 2 z 2 + 2 3 z 3 + 1 2 z1 2 + 2 2 z2 2 + 3 2 z3 2 + 2 1 2 z 1 z 2 + 2 1 3 z 1 z 3 + 2 2 3 z 2 z 3 But this can be written as (1 + z) 2 = (1 + 1 z 1 + 2 z 2 + 3 z 3 ) 2 which csts much less peratins t cmpute. Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 25 / 55

Plynmial kernel: numeric eample Suppse = (3, 2, 6) and z = (4, 1, 5). Then φ() = (1, 3 2, 2 2, 6 2, 9, 4, 36, 6 2, 18 2, 12 2) φ(z) = (1, 4 2, 1 2, 5 2, 16, 1, 25, 4 2, 20 2, 5 2) Then φ() φ(z) = 1 + 24 + 4 + 60 + 144 + 4 + 900 + 48 + 720 + 120 = 2025. But (1 + z) 2 = (1 + (3)(4) + (2)(1) + (6)(5)) 2 = 45 2 = 2025 is a mre efficient way t cmpute this dt prduct. Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 26 / 55

Kernels Linear kernel k(, ) = Tw ppular nn-linear kernels are the plynmial kernel k(, ) = ( + c) M and Gaussian (r radial) kernel k(, ) = ep( 2 /2σ 2 ), (6.23) r where γ = 1 2σ 2. k(, ) = ep( γ 2 ), Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 27 / 55

Dual Representatin with kernels Using k(, ) = φ() φ( ) we get dual representatin: Maimize L(a) = N a n 1 N a n t n a m t m k( n, m ) (7.10) 2 n=1 n,m=1 with respect t a and subject t the cnstraints a n 0, n = 1,..., N (7.11) N a n t n = 0. (7.12) n=1 Is this dual easier than the riginal prblem? Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 28 / 55

Predictin Recall that Substituting int (7.1), we get y() = w φ() + b (7.1) N w = a n t n φ( n ) (7.8) n=1 N y() = b + a n t n k(, n ) (7.13) n=1 Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 29 / 55

Predictin: supprt vectrs KKT cnditins: a n 0 (7.14) t n y( n ) 1 0 (7.15) a n {t n y( n ) 1} = 0 (7.16) Frm (7.16) it fllws that fr every data pint, either 1 a n = 0, r 2 t n y( n ) = 1. The frmer play n rle in making predictins (see 7.13), and the latter are the supprt vectrs that lie n the maimum margin hyper planes. Only the supprt vectrs play a rle in predicting the class f new attribute vectrs! Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 30 / 55

Predictin: cmputing b Since fr any supprt vectr n we have t n y( n ) = 1, we can use (7.13) t get ( t n b + ) a m t m k( n, m ) = 1, (7.17) m S where S dentes the set f supprt vectrs. Hence we have t n b + t n a m t m k( n, m ) = 1 m S t n b = 1 t n a m t m k( n, m ) m S b = t n m S a m t m k( n, m ) (7.17a) since t n { 1, +1} and s 1/t n = t n. Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 31 / 55

Predictin: cmputing b A numerically mre stable slutin is btained by averaging (7.17a) ver all supprt vectrs: ( b = 1 t n ) a m t m k( n, m ) (7.18) N S m S n S Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 32 / 55

Predictin: Eample We receive the fllwing utput frm the ptimizatin sftware fr fitting a supprt vectr machine with linear kernel and perfect separatin f the training data: n n,1 n,2 t n a n 1 2 2 1 0 2 1 3 1 1 3 3 1 1 1 4 3 6 +1 0 9 5 4 4 +1 8 6 6 5 +1 0 Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 33 / 55

Predictin: Eample The figure belw is a plt f the same data set, where the dts represent pints with class 1, and the crsses pints with class +1. Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 34 / 55

Predictin: Eample (a) Cmpute the value f the SVM bias term b. Data pints with a > 0 are supprt vectrs. Let s take the pint 1 = 4, 2 = 4 with class label +1: b = t m N n=1 [ ] [ ] [ ] a n t n 1 3 m n = 1 + [4 4] + [4 4] 9 4 3 1 8 [4 4] = 3 4 (b) Which class des the SVM predict fr the data pint 1 = 5, 2 = 2? y() = b + N n=1 [ ] [ ] [ ] a n t n 1 3 n = 3 [5 2] [5 2] + 9 4 3 1 8 [5 2] = 1 4 2 Since the sign is psitive, we predict class +1. Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 35 / 55

Predictin: Eample Decisin bundary and supprt vectrs. Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 36 / 55

Allwing Errrs S far we assumed that the training data pints are linearly separable in feature space φ(). Resulting SVM gives eact separatin f training data in riginal input space, with nn-linear decisin bundary. Class distributins typically verlap, in which case eact separatin f the training data leads t pr generalizatin (verfitting). Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 37 / 55

Allwing Errrs Data pints are allwed t be n the wrng side f the margin bundary, but with a penalty that increases with the distance frm that bundary. Fr cnvenience we make this penalty a linear functin f the distance t the margin bundary. Intrduce slack variables ξ n 0 with ne slack variable fr each training data pint. Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 38 / 55

Definitin f Slack Variables We define ξ n = 0 fr data pints that are n the inside f the crrect margin bundary and ξ n = t n y( n ) fr all ther data pints. ξ = 0 ξ = 0 ξ < 1 ξ > 1 y() = 1 y() = 0 y() = 1 Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 39 / 55

New Cnstraints The eact classificatin cnstraints t n y( n ) 1 n = 1,..., N (7.5) are replaced by t n y( n ) 1 ξ n n = 1,..., N (7.20) Check (7.20): ξ n = 0 fr data pints that are n the inside f the crrect margin bundary. In that case y n t n 1. Suppse t n = +1 and n the wrng side f the margin bundary, i.e. y n t n < 1. Since y n = y n t n, we have and therefre t n y n = 1 ξ n. Suppse t = 1... ξ n = t n y n = 1 y n t n = 1 y n t n Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 40 / 55

New bjective functin Our gal is t maimize the margin while sftly penalizing pints that lie n the wrng side f the margin bundary. We therefre minimize C N ξ n + 1 2 w 2 (7.21) n=1 where the parameter C > 0 cntrlls the trade-ff between the slack variable penalty and the margin. Alternative view (divide by C and put λ = 1 2C : N n=1 ξ n + λ w 2 i First term represents lack-f-fit (hinge lss) and secnd term takes care f regularizatin. Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 41 / 55

Optimizatin Prblem The Lagrangian is given by L(w, b, a) = 1 N N N 2 w 2 + C ξ n a n {t n y( n ) 1 + ξ n } µ n ξ n (7.22) n=1 n=1 n=1 where a n 0 and µ n 0 are Lagrange multipliers. The KKT cnditins are given by: a n 0 (7.23) t n y( n ) 1 + ξ n 0 (7.24) a n (t n y( n ) 1 + ξ n ) = 0 (7.25) µ n 0 (7.26) ξ n 0 (7.27) µ n ξ n = 0 (7.28) Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 42 / 55

Dual Take derivative with respect t w, b and ξ n and equate t zer: L N w = 0 w = a n t n φ( n ) (7.29) n=1 L N b = 0 a n t n = 0 (7.30) n=1 L ξ n = 0 a n = C µ n (7.31) Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 43 / 55

Dual Using these t eliminate w, b and ξ n frm the Lagrangian, we btain the dual Lagrangian: Maimize L(a) = N a n 1 2 n=1 N n,m=1 with respect t a and subject t the cnstraints a n t n a m t m k( n, m ) (7.32) 0 a n C, n = 1,..., N (7.33) N a n t n = 0. (7.34) n=1 Nte: we have a n C since µ n 0 (7.26) and a n = C µ n (7.31). Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 44 / 55

Predictin Recall that y() = w φ() + b (7.1) Substituting N w = a n t n φ( n ) (7.8) int (7.1), we get y() = with k(, n ) = φ() φ( n ). n=1 N a n t n k(, n ) + b (7.13) n=1 Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 45 / 55

Interpretatin f Slutin We distinguish tw cases: Pints with a n = 0 d nt play a rle in making predictins. Pints with a n > 0 are called supprt vectrs. It fllws frm KKT cnditin a n (t n y( n ) 1 + ξ n ) = 0 (7.25) that fr these pints t n y n = 1 ξ n Again we have tw cases: If a n < C then µ n > 0, because a n = C µ n. Since µ n ξ n = 0 (7.28), it fllws that ξ n = 0 and hence such pints lie n the margin. Pints with a n = C can be n the margin r inside the margin and can either be crrectly classified if ξ n 1 r misclassified if ξ n > 1. Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 46 / 55

Cmputing the intercept T cmpute the value f b, we use the fact that thse supprt vectrs with 0 < a n < C have ξ n = 0 s that t n y( n ) = 1, s like befre we have b = t n m S a m t m k( n, m ) (7.17a) Again a numerically mre stable slutin is btained by averaging (7.17a) ver all data pints having 0 < a n < C: ( b = 1 t n ) a m t m k( n, m ) (7.37) N M m S n M Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 47 / 55

Mdel Selectin As usual we are cnfrnted with the prblem f selecting the apprpriate mdel cmpleity. The relevant parameters are C and any parameters f the chsen kernel functin. Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 48 / 55

Hw t in R > cnn.svm.lin <- svm(cause sdium + c2, data=cnn.dat,kernel="linear") > plt(cnn.svm.lin,cnn.dat) > cnn.svm.lin.predict <- predict(cnn.svm.lin,cnn.dat[,1:2]) > table(cnn.dat[,3],cnn.svm.lin.predict) cnn.svm.lin.predict 0 1 0 17 3 1 2 8 Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 49 / 55

Cnn s syndrme: linear kernel SVM classificatin plt 146 sdium 144 142 140 0 1 138 22 24 26 28 30 32 c2 Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 50 / 55

Hw t in R > cnn.svm.rad <- svm(cause sdium + c2, data=cnn.dat) > plt(cnn.svm.rad,cnn.dat) > cnn.svm.rad.predict <- predict(cnn.svm.rad,cnn.dat[,1:2]) > table(cnn.dat[,3],cnn.svm.rad.predict) cnn.svm.rad.predict 0 1 0 17 3 1 2 8 Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 51 / 55

Cnn s syndrme: radial kernel, C = 1 SVM classificatin plt 146 sdium 144 142 140 0 1 138 22 24 26 28 30 32 c2 Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 52 / 55

Hw t in R > cnn.svm.rad <- svm(cause sdium + c2, data=cnn.dat,cst=100) > plt(cnn.svm.rad,cnn.dat) > cnn.svm.rad.predict <- predict(cnn.svm.rad,cnn.dat[,1:2]) > table(cnn.dat[,3],cnn.svm.rad.predict) cnn.svm.rad.predict 0 1 0 19 1 1 1 9 Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 53 / 55

Cnn s syndrme: radial kernel, C = 100 SVM classificatin plt 146 sdium 144 142 140 0 1 138 22 24 26 28 30 32 c2 Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 54 / 55

SVM in R LIBSVM is available in package e1071 in R. It can als perfrm regressin and nn-binary classificatin. Nn-binary classificatin is perfrmed as fllws: Train K(K 1)/2 binary SVM s n all pssible pairs f classes. T classify a new pint, let it be classified by every binary SVM, and pick the class with the highest number f vtes. This is dne autmatically by functin svm in e1071. Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 55 / 55