Support Vector Machnes and Other Kernel Methods Krstn P. Bennett Mathematcal Scences Department Rensselaer Polytechnc Insttute
Support Vector Machnes (SVM) A methodology for nference based on Statstcal Learnng Theory of Vapnk Three Key Ideas: Capacty Control (maxmze margns for classfcaton) Dualty Kernels
Outlne Intutve gude to SVM Classfcaton Kernel method case studes: Support Vector Regresson Kernel Prncpal Components Analyss Kernel for dfferent knds of data Practcal consderatons Hype or Hallelujah?
Bnary Classfcaton Example Medcal Dagnoss Is t bengn or malgnant? f( z) = L( y, g( x)) 0 g L y s predcton functon s loss functon {,}
Lnear Classfcaton Model Gven tranng data {( ) ( )} x, y,..., x, y x R, y {, } Lnear model - fnd n w R, b R n Such that y sgn( w'x b)
Intutve Lnear Classfcaton Ye N
Predctve New Pont? Ye??? N
Best Lnear Separator?
Best Lnear Separator?
Best Lnear Separator?
Best Lnear Separator?
Best Lnear Separator?
Fnd Closest Ponts n Convex Hulls d c
Plane Bsects Closest Ponts x w = b d c w = d c
Fnd usng quadratc program 2,, 2 mn.. 0.,.., c d c s d x c x d t α α α α α α = = = = = Many exstng and new
Best Lnear Separator: Supportng Plane Method Maxmze Between two supportng x w = b + x w = b Dstance = Margn = 2 w
Maxmze margn usng quadratc program w, b w 2 2 mn st.. x w b+ Class x w b Class
Dual of Closest Ponts Method s Support Plane Method mn α y mn αx w 2 2 = 2 2 wb, ( ) st.. α = α = st.. y x w b α 0 =,.., Soluton only depends on support α > 0 Class w= yα x y : = = Class
Statstcal Learnng Theory Msclassfcaton error and the functon complexty bound generalzaton error. Maxmzng margns mnmzes complexty. Elmnates overfttng. Soluton depends only on Support Vectors not number of attrbutes.
Margns and Generalzaton Sknny margn has more to ft data. Thus much to be unlucky.
Margns and Generalzaton Fat margn has less capacty to ft the Thus won t be as
One bad example? Convex Hulls Same argument won t work.
Don t trust a sngle pont! Each pont must depend at least two actual data
Depend on >= two ponts Each pont must depend at least two actual data
Depend on >= two ponts Each pont must depend at least two actual data
Depend on >= two ponts Each pont must depend at least two actual data
Depend on >= two ponts Each pont must depend at least two actual data
Fnal Reduced/Robust Set Each pont must depend at least two actual data Called Reduced Convex
Reduced Convex Hulls Don t Intersect d = Class 0 Class α = α D α x D = 2 Reduce by upper bound
Fnd Closest Ponts Then Bsect mn α α x α x s.. t 2 2 = α = 0 α α D No change except for D. D determnes number of Support
Lnearly Inseparable Case: Soft Margn Method Just add non-negatve vector z. wb,, z w 2 2 mn s. t ( + ) y x w b z + C = 0 =,.., z + z
Dual of Closest Ponts Method s Soft Margn Method mn α 2 2 yα x mn w + C z 2 2 = wb, = ( ) st.. α = α = st.. y x w b + z 0 D α 0 z 0 =,.., Soluton only depends on support α > 0 w= yα x =
Nonlnear Classfcaton n Feature Space x = x θ [ a, b] w = w a + 2 2 2 ( x ),, 2 g ( x ) = θ ( x ) w w b = a b a b = w a + w b + w 2 a b 2 2 2 3 2 a 2ab 2 b
Nonlnear Classfcaton: Map to hgher dmensonal space IDEA: Map each pont to hgher dmensonal feature construct lnear dscrmnant n the hgher Dual SVM mn α st.. n n' Defne : ( x) : R R n' 2 = C = j= y θ >> α y y = 0 α α θ ( x ) θ ( x j j j α 0 =,.,. n ) α
Kernel Calculates Inner Product u = [ u, u ] 2 φ ( u ) φ ( v ) 2 2 2 2 = u, u 2, 2 u u 2 v, v 2, 2 vv 2 = u v + u v + 2 u u v v 2 2 2 2 2 2 2 2 ( u v u v ) = + = u, v 2 2 2 2 Thus: K ( u, v ) = u, v 2
Fnal Classfcaton va Kernels The Dual SVM mn α 2 = j j j = j= st.. yα = 0 C K( x α 0 =,.., yy α α, x ) α
Generalzed Inner Product By Hlbert-Schmdt Kernels (Courant and for certan η and K, θ ( u) θ ( v) K( u, v) θ ( u) K( u, v) Degree d polynomal ( u v + ) Radal Bass Functon Machne 2 u v exp 2σ Two Layer Neural Network sgmod ( η ( u v) + c) d
Fnal SVM Algorthm Solve Dual SVM QP Recover prmal varable b Classfy new x f() x = sgn α yk(, x x) b = 0 Soluton only depends on support α >
Support Vector Machnes (SVM) Key Formulaton Ideas: Capacty Control by maxmzng margns Dualty Kernels Generalzaton Error Bounds Few Parameters to Tune Practcal Algorthms
Kernel Methods General methodology Pck Loss functon and apply to lnear functon f loss( x, y) = max( yf ( x),0) Pck Capacty Control/Regularzaton w 2 Formulate Prmal Construct Dual Kernelze Apply standard algorthms You can do ths for most loss functons. After BREAK demonstrate for regresson and PCA
Cortes and Vapnk: Fgure : degree 2 SV =crcles errors
Fg 6: US postal 7.3K tran 2K test (6 by 6
Results on US postal servce:
Errors on US postal
NIST data 60K tran 0K test 28X28 4 degree polynomal msclassfed false negatves, others false postves
NIST