Whch Separator? 6.034 - Sprng 1
Whch Separator? Mamze the margn to closest ponts 6.034 - Sprng
Whch Separator? Mamze the margn to closest ponts 6.034 - Sprng 3
Margn of a pont " # y (w $ + b) proportonal to perpendcular dstance of pont to hyperplane k! k w! 1 6.034 - Sprng 4
Margn of a pont " # y (w $ + b) proportonal to perpendcular dstance of pont to hyperplane geometrc margn s " w k! k w! 1 6.034 - Sprng 5
Margn # " y ( w! + b) Scalng w changes value of margn but not actual dstances to separator (geometrc margn) Pck the margn to closest postve and negatve ponts to be 1 + 1( w! " 1( w! 1 + b) = + b) = 1 1 1 6.034 - Sprng 6
Margn Pck the margn to closest postve and negatve ponts to be 1 + 1( w! " 1( w! Combnng these 1 + b) = + b) = 1 1 w 1 " (! ) = Dvdng by length of w gves perpendcular dstance between lnes ( geometrc margn) w 1 " (! ) = w w 6.034 - Sprng 7
Pckng w to Mamze Margn Pck w to mamze geometrc margn w or, equvalently, mnmze w = w " w or, equvalently, mnmze 1 w = 1 w " w = 1! j w j 6.034 - Sprng 8
Pckng w to Mamze Margn Pck w to mamze geometrc margn w or, equvalently, mnmze 1 w =! whle classfyng ponts correctly 1 w " w = 1 j w j y ( w " + b)! 1 1 1 or, equvalently, y ( w # + b) " 1! 0 6.034 - Sprng 9
Constraned Optmzaton mn w 1 w subject to y ( w $ + b) # 1 " 0,! 6.034 - Sprng 10
Constraned op5mza5on No Constrant -1 1 *=0 *=0 *=1 How do we solve wth constrants? à Lagrange Multplers!!!
Lagrange mul5plers Dual varables Add Lagrange multpler Introduce Lagrangan (objectve): Rewrte Constrant We wll solve: Why does ths work at all??? mn s fghtng ma! <b à (-b)<0 à ma α -α(-b) = mn won t let that happen!! >b, α>0à (-b)>0 à ma α -α(-b) = 0, α*=0 Add new constrant mn s cool wth 0, and L(, α)= (orgnal objectve) =b à α can be anythng, and L(, α)= (orgnal objectve) Snce mn s on the outsde, can force ma to behave and constrants wll be satsfed!!!
Constraned Optmzaton mn w 1 w subject to y ( w $ + b) # 1 " 0,! Convert to unconstraned optmzaton by ncorporatng the constrants as an addtonal term ( 1 mn& w ' w % ) +,, $ [ ] y ( w * + b) ) 1 #, " 0! 6.034 - Sprng 11
Constraned Optmzaton mn w 1 w subject to y ( w $ + b) # 1 " 0,! Convert to unconstraned optmzaton by ncorporatng the constrants as an addtonal term ( 1 mn& w ' w % ) +,, $ [ ] y ( w * + b) ) 1 #, " 0! To mnmze epresson: mnmze frst (orgnal) term, and mamze second (constrant) term snce α > 0, encourages constrants to be satsfed but we want least dstorton of orgnal term 6.034 - Sprng 1
Constraned Optmzaton mn w 1 w subject to y ( w $ + b) # 1 " 0,! Convert to unconstraned optmzaton by ncorporatng the constrants as an addtonal term ( 1 mn& w ' w % ) +,, $ [ ] y ( w * + b) ) 1 #, " 0! Lagrange multplers To mnmze epresson: mnmze frst (orgnal) term, and mamze second (constrant) term snce α > 0, encourages constrants to be satsfed but we want least dstorton of orgnal term Method of Lagrange multplers 6.034 - Sprng 13
Mamzng the Margn 1 L( w, b) = w "! $ " [ ( # + ) 1] y w b 6.034 - Sprng 14
Mamzng the Margn 1 L( w, b) = w "! $ " Mnmzed when: [ ( # + ) 1] y w b! * w = " y!" y = 0 6.034 - Sprng 15
Mamzng the Margn 1 L( w, b) = w "! $ " Mnmzed when: [ ( # + ) 1] y w b! * w = " y Substtutng w* nto L yelds dual Lagrangan:!" y = 0 L(") = m #" $ 1 =1 m # =1 m # k=1 " " k y y k k Only dot products of the feature vectors appear 6.034 - Sprng 16
Dual Lagrangan ma " L(") subject to #" y = 0 and " $ 0, % 6.034 - Sprng 17
Dual Lagrangan ma " L(") subject to #" y = 0 and " $ 0, % In general, snce α >= 0, ether α = 0: constrant s satsfed wth no dstorton at optmum w or α > 0: constrant s satsfed wth equalty (n ths case s known as a support vector) " = 0 " = 0 " = 0 " = 0 6.034 - Sprng 18
Dual Lagrangan ma " L(") subject to #" y = 0 and " $ 0, % In general, snce α >= 0, ether α = 0: constrant s satsfed wth no dstorton at optmum w or α > 0: constrant s satsfed wth equalty ( s known as a support vector) " = 0 w * = " y! b = 1 y " w * " = 0 " = 0 " = 0 6.034 - Sprng 19
Dual Lagrangan ma " L(") subject to #" y = 0 and " $ 0, % In general, snce α >= 0, ether α = 0: constrant s satsfed wth no dstorton at optmum w or α > 0: constrant s satsfed wth equalty ( s known as a support vector) " = 0 w * = " y! b = 1 y " w * Has a unque mamum vector Can be found usng quadratc programmng or gradent ascent " = 0 " = 0 " = 0 6.034 - Sprng 0
SVM Classfer Gven unknown vector u, predct class (1 or -1) as follows: k & h( u) = sgn$ () y % = 1 ' u + # b! " The sum s over k support vectors 6.034 - Sprng 1
Bankruptcy Eample 6.10-6.69 31.87 α y for support vectors are non-zero, all others are zero. -31.8 6.034 - Sprng
Key Ponts Learnng depends only on dot products of sample pars. Recognton depends only on dot products of unknown wth samples. 6.034 - Sprng 3
Key Ponts Learnng depends only on dot products of sample pars. Recognton depends only on dot products of unknown wth samples. Eclusve relance on dot products enables approach to non-lnearly-separable problems. 6.034 - Sprng 4
Key Ponts Learnng depends only on dot products of sample pars. Recognton depends only on dot products of unknown wth samples. Eclusve relance on dot products enables approach to non-lnearly-separable problems. The classfer depends only on the support vectors, not on all the tranng ponts. 6.034 - Sprng 5
Key Ponts Learnng depends only on dot products of sample pars. Recognton depends only on dot products of unknown wth samples. Eclusve relance on dot products enables approach to non-lnearly-separable problems. The classfer depends only on the support vectors, not on all the tranng ponts. Ma margn lowers hypothess varance. 6.034 - Sprng 6
Key Ponts Learnng depends only on dot products of sample pars. Recognton depends only on dot products of unknown wth samples. Eclusve relance on dot products enables approach to non-lnearly-separable problems. The classfer depends only on the support vectors, not on all the tranng ponts. Ma margn lowers hypothess varance. The optmal classfer s defned unquely there are no local mama n the search space Polynomal n number of data ponts and dmensonalty 6.034 - Sprng 7
Not Lnearly Separable? Requre 0 " # " C C specfed by user; controls tradeoff between sze of margn and classfcaton errors C = 1 for separable case 6.034 - Sprng 8
C Change C=10 C=1 6.034 - Sprng 9
C Change C=100 C=1 6.034 - Sprng 30
Eample: Lnearly Separable Image by Patrck Wnston 6.034 - Sprng 31
Another eample: Not lnearly separable Image by Patrck Wnston 6.034 - Sprng 3
Isn t a lnear classfer very lmtng? - - - - + + + + + + R - + - - - R + - - - - - - - - + + + + + - - 1 + + + + + - - R 1 not lnearly separable lnearly separable usng squared value of features. Important: Lnear separator n transformed feature space maps nto non-lnear separator n orgnal feature space 6.034 - Sprng 33
Not separable? Try a hgher dmensonal space! Not separable wth D lne Separable wth 3D plane 6.034 - Sprng 34
What you need To get nto the new feature space, you use!( ) The transformaton can be to a hgher-dmensonal feature space and may be non-lnear n the feature values. 6.034 - Sprng 35
What you need To get nto the new feature space, you use!( ) The transformaton can be to a hgher-dmensonal feature space and may be non-lnear n the feature values. Recall that SVM s only use dot products of the data, so k To optmze classfer, you need!( ) "!( ) To run classfer, you need! ( ) "!( u) So, all you need s a way to compute dot products n transformed space as a functon of vectors n orgnal space! 6.034 - Sprng 36
6.034 - Sprng 37 The Kernel Trck If dot products can be effcently computed by Then, all you need s a functon on low-dm nputs You don t need ever to construct hgh-dmensonal ), ( ) ( ) ( k k K =! "! ), ( k K ) (!
Standard Choces For Kernels No change (lnear kernel) "( ) k k )! "( ) = K(, =! k 6.034 - Sprng 38
Standard Choces For Kernels No change (lnear kernel) "( ) k k )! "( ) = K(, =! k Polynomal kernel (n th order) k k n K (, ) = (1 +! ) 6.034 - Sprng 39
0.1 Polynomal Kernel Eample (one feature) 0. 0.3 0.4 0.5 0.6 Not separable 6.034 - Sprng 40
0.1 Polynomal Kernel Eample (one feature) 0. 0.3 0.4 0.5 0.6 Not separable!( ) = (,, 1) 0.4 Separable 0.35 ^ 0.3 0.5 0. Neg Pos!( ) "!( z) = z = (1 + + z z) + 1 0.15 0.1 0.05 0 0 0. 0.4 0.6 0.8 1 sqrt() 6.034 - Sprng 41
Polynomal Kernel Polynomal kernel for n= and features =[ 1 ] K(, z) = (1 +! z) s equvalent to the followng feature mappng:!( ) = [ 1 1 1 1] We can verfy that: "() # "(z) = 1 z 1 + z + 1 z 1 z + 1 z 1 + z +1 = (1 + 1 z 1 + z ) = (1 + # z) = K(,z) 6.034 - Sprng 4
Polynomal Kernel Images by Patrck Wnston 6.034 - Sprng 43
6.034 - Sprng 44 Standard Choces For Kernels No change (lnear kernel) Polynomal kernel (n th order) Radal bass kernel (σ s standard devaton) k k k K! = = "! " ), ( ) ( ) ( n k k K ) (1 ), (! + = ) ( ) ( ), (!! k k k e e K k " # " " " " = =
Radal-bass kernel Classfer based on sum of Gaussan bumps wth standard devaton σ, centered on support vectors. [ h ( )] h ( u) = sgn! u u h "(u) = k $ # y K(,u) + b =1 1 K(, u) = e "! " u 6.034 - Sprng 45
Radal-bass kernel! = 0.1 0.1 0. 0.3 0.4 0.5 0.6 6.034 - Sprng 46
y 1 " 1 =1.76 y " = #1.76 y 3 " 3 =1.76 y 4 " 4 = #1.76 Radal-bass kernel b = 0.55! = 0. 1.5 1.5 1 0.5 0-0.5-1 -1.5 0 0.1 0. 0.3 0.4 0.5 0.6 0.7 0.8 0.1 0. 0.3 0.4 0.5 0.6 support vectors 6.034 - Sprng 47
y 1 " 1 =1.76 y " = #1.76 y 3 " 3 =1.76 y 4 " 4 = #1.76 h "(u) =.5 4 Radal-bass kernel $ # y K(,u) + b =1 b = 0.55! = 0. 1 K(, u) = e " " u! 1.5 1 0.5 0-0.5-1 -1.5 0 0.1 0. 0.3 0.4 0.5 0.6 0.7 0.8 0.1 0. 0.3 0.4 0.5 0.6 support vectors 6.034 - Sprng 48
Radal-bass kernel (large σ) Images by Patrck Wnston 6.034 - Sprng 49
Another radal-bass eample (small σ) Image by Patrck Wnston 6.034 - Sprng 50
Cross-Valdaton Error Does mappng to a very hgh-dmensonal space lead to over-fttng? Generally, no, thanks to the fact that only the support vectors determne the decson surface. 6.034 - Sprng 51
Cross-Valdaton Error Does mappng to a very hgh-dmensonal space lead to over-fttng? Generally, no, thanks to the fact that only the support vectors determne the decson surface. The epected leave-one-out cross-valdaton error depends on number of support vectors, not dmensonalty of feature space. Epected CV error! Epected # support vectors # tranng samples If most data ponts are support vectors, a sgn of possble overfttng, ndependent of the dmensonalty of feature space. 6.034 - Sprng 5
Summary A sngle global mamum Quadratc programmng or gradent descent 6.034 - Sprng 53
Summary A sngle global mamum Quadratc programmng or gradent descent Fewer parameters C and kernel parameters (n for polynomal, σ for radal bass kernel) 6.034 - Sprng 54
Summary A sngle global mamum Quadratc programmng or gradent descent Fewer parameters C and kernel parameters (n for polynomal, σ for radal bass kernel) Kernel Quadratc mnmzaton depends only on dot products of sample vectors Recognton depends only on dot products of unknown vector wth sample vectors Relance on only dot products enables effcent feature mappng to hgher-dmensonal spaces where lnear separaton s more effectve. 6.034 - Sprng 55
Real Data Wsconsn Breast Cancer Data 9 features C=1 37 support vectors are used from 51 tranng data ponts 1 predcton errors on tranng set (98% accuracy) 96% accuracy on 171 held out ponts Essentally same performance as nearest neghbors and decson trees Don t epect such good performance on every data set. 6.034 - Sprng 56
Success Stores Gene mcroarray data outperformed all other classfers specally desgned kernel Tet categorzaton lnear kernel n >10,000 D nput space best predcton performance 35 tmes faster to tran than net best classfer (decson trees) Many others: http://www.clopnet.com/sabelle/projects/svm/ applst.html 6.034 - Sprng 57