Chapter 10 The Support-Vector-Machine (SVM) A statistical approach of learning theory for designing an optimal classifier

Size: px

Start display at page:

Download "Chapter 10 The Support-Vector-Machine (SVM) A statistical approach of learning theory for designing an optimal classifier"

Julius White
6 years ago
Views:

1 Chapter 0 The Support-Vector-Machne (SVM) A statstcal approach of learnng theory for desgnng an optmal classfer

2 Content:. Problem 2. VC-Dmenson and mnmzaton of overall error 3. Lnear SVM Separable classes (example) Non-separable classes (example) 4. Non-lnear SVM Trck wth kernel functon (example) Mercer`s Theorem 5. Propertes and complexty 6. Presentaton of research projects: Claus Bahlmann: recognton of handwrtng Olaf Ronneberger: Autom. recognton of pollen Bernard Haasdonk: Tangental dstance and SVM

4 Pattern recognton experment Stochastc process wth unknown jont dstrbuton: P(x,y) learn set of data l emprcal error R ( ) emp α test set od data test error (generalzaton ablty) Rtest ( α) Sought s a mappng functon f(x,α): x y that s charachterzed by the unknown parameters of the vector α, whch mnmzes the expected value of the mappng error: R(α)=E{R test (α)}

5 Learnng theoretcal approach Supervsed learnng: Gven: l observatons (Learnng sample) from the pattern recognton experment wth jont dstrbuton P(x,y) wth correspondng class maps (labels) (ntally restrant to two-class-problem); besdes that no prevous knowledge exsts: {( x, y ) P( x, y) } =,, l wth: x N, y { +, } Sought: determnstc mappng functon, f(x,α): x y based on a learnng sample, whch mnmzes the expected value of the mappng error for the test set of data (expected rsk) : { test } { } R( α) = E R ( α) = E y f( x, α) = y f( x, α) dp( x, y) = y f( x, α) p( x, y) dxdy f jont dstrbuton known Problem: ths expresson cannot be evaluated, snce P(x,y) s not avalable and thus not very useful!

6 Central problem of the statstcal learn theory: When does a small learnng error lead to a small real error? The emprcal rsk,.e. the error rate for a gven tranng set of data can be calculated easly accordng to: l emp α = 2l = R ( ) y f( x, α) The approach wth a neural net and e.g. backpropagaton learnng s satsfed wth an Emprcal Rsk Mnmzaton (ERM) The queston: How close are we to the real error after l tranng examples? And: How good can we estmate the real rsk from the emprcal rsk (Structural Rsc Mnmzaton (SRM) nstead of Emprcal Rsk Mnmzaton (ERM))? Ths contans generalzaton ablty! The learnng theory of Vapnk-Chervonenks tres to answer that queaton!

7 An upper lmt of generalzaton ablty of learnng wth the VC-theory (Vapnk/Chervonenks) Wth probablty (-η) the followng estmate for the effectve error apples (.e. e.g. wth η = 0,05 and thus wth a probablty of 95%): R( α) Remp ( α) +Φ( h, l, η) VC-Konfdenz 2l η h log + log h 4 mt: Φ ( hl,, η) = l Desgn target: keep VC-confdence as lttle as possble and thus the estmate as exact as possble! l number of trang examples h VC-dmenson of the used hypothess space Remarkably ths expresson s ndependent on the underlyng jont dstrbuton P(x,y)! (provded that the tranng and testng sets of data are consdered statstcally ndependent),.e. n case there are a couple of dfferent learnng machnes to choose from (accompaned by specal famles of mappng functons f(x,α)), the machne, that yelds the lowest value for confdence Φ for gven emprcal error, s to be chosen.

8 The VC-dmenson s a property for gven functon classes {f(α)} A gven set of l ponts can be dvded nto 2 l possble classes for the twoclass. The VC-dmenson of a set of functons {f(α)}, s defned as the maxmal number of tranng ponts, that can be separated by ths class of functons n all possble constellatons. provdes measures for capacty of functon classes functon classes e.g. set of lnear separaton functons set of artfcal NN (MLP) set of polynomal functons set of radal bass functons The VC-confdence grows monotoncally wth h: Accordng to that, when choosng a learnng machne whose emprcal rsk s 0, the one wth mnmal VC-dmenson of assocated set of mappng functons s to be chosen.

9 VC-dmenson h of hyperplanes n R 2 Three ponts n R 2 can be separated n all conformatons wth hyperplanes (lnes) Four ponts n R 2 cannot be separated wth hyperplanes n all conformatons anymore; but e.g. wth polynomals! => hyperplanes n R 2 : h = 3 Generally: hyperplanes n R N : h = N+ the VC-dmenson of polymonals ncreases wth ther degree!

10 VC-dmenson h of hyperplanes n R 3 Four ponts n R 3 can be separated wth hyperplanes (planes) n all conformatons Fve ponts n R 3 cannot be separated anymore wth hyperplanes n all conformatons (e.g. the red ponts n one class); but e.g. polynomals! => Hyperplanes n R 3 : h = 4 Generally: Hyperplanes n R N : h = N+

11 VC-dmenson of functon classes Statements about VC-dmenson of functon classes possble! VC-dmenson of set of hyperplanes n R N 2 R h mn(, N) + 2 δ R: radus, n whch all data ponts le δ: Margn between the hyperplanes VC-dmenson of the set of polynom kernels of degree p: K(x,x j )=(x x j ) p h = N + p p + Thus Structural Rsk Mnmzaton possble

12 Structural Rsk Mnmzaton (SRM) SRM means mnmzng the estmate for R(α) over cumulatve functon classes S. These form the hypothess spaces wth cumulatve VCdmenson VCdm=h (h <h 2 <h 3 <...). Compromse between emprcal rsk and generalzaton ablty optmum R(α) S S 2 S 3 S 4 Φ opt VCdm: h R emp Φ ncreases wth h R emp decreases wth h (overfttng)

13 Notes for desgnng There exst several expressons for the same phenomenon: Bas/varance-compromse Generalzaton-/Overfttng-compromse Emprcal error/vc-dmenson-(capacty control) -compromse

14 Bas-varance-tradeoff ŷ yˆ = f ( x) 2 2 yˆ = f ( x) f n Bas n ( y f( x)) lttle = : n 2 varance n ( y f( x)) bg = good generalzaton (smple model)! { y, x} x f 2 : bad generalzaton (overfttng)! n Bas n ( y f2( x)) bg = n 2 varance n ( y f2( x)) lttle =

15 Lnearly separable classes Datan RI 2 General hyperplane wx, + b = 0 classfcaton va f ( x) = sgn( wx, + b) e.g. Rosenblatt s Perceptron (956) Iteratve learnng, correcton after every false classfcaton -> soluton not unque

16 Optmal hyperplane: maxmzng the margn Consderng a lnearly separable two-class-problem all lnes, that separate both classes, can reach an emprcal error zero The confdence s mnmzed by a polynom of mnmal VC-dmenson, that s a hyperplane The VC-dmenson can be decreased further by large margn hyperplanes Separatng hyperplane wth maxmal margn Separatng hyperplane wth mnmal VC-dmenson Ths result s plausble. Wth constant ntra-class dsperson the classfcaton certanty ncreases wth ncreasng nter-class dstance. or: wth constant nter-class dstance the maxmal ntra-class dsperson must be maxmally feasble.

17 Large-margn -classfer Hyperplane wth largest margn [Vap63] S : h 3 3 S : h 2 2 S : h 2δ hgh VC-dmenson N n : N + medum VC-dmenson varablty gets smaller! smallest VC-dmenson mt maxmal breadth varablty equals zero Clearly reasonable theoretcally founded soluton depends on few data: => supportvectors S S 2 h < h < < h 2 3 S 3

18 VC-dmenson of broad hyperplanes R R δ δ 2 The VC-dmenson h of hyperplanes wth mnmal dstance δ of the ponts to be separated s lmted by: 2 R h + δ h 2 > h 2 => bg margn δ, small VCdm h

19 Formalzaton: sought s separaton plane wth maxmum margn δ The data s classfed correctly f: wx, + b = - x w normal vector δ wx, + b = 0 wx, + b = The dstance between the canoncal hyperplanes results from projecton of x -x 2 to the normal vector w/ w under the constrant: δ x 2 + dstance of a pont to the separatng plane:, + b d( w, b; x) = wx w y ( wx, + b) > 0 Note: Introducng canoncal hyperplanes by ponts of both classes that le closest to the separaton plane. The expresson s nvarant to a postve re-scalng: wx, + b = margn for blue class wx, + b =+ margn for red class 2 { x x }, support vectors 2 w 2 2 δ =,( x x ) = w, x w, x = w w w δ = / Maxmzaton of δ s equal to mnmzaton of w 2 => y ( awx, + ab) > 0 A s chosen, so that the followng apples: 2 2 ( b) ( + b) w

20 Optmzaton problem mnmze J( w, b) = w = w, w Prmal OP: under sde cond.: y (, + ) [ wx b ] Introducng a Lagrange functon: l 2 L( w, b, α) = w α[ y( w, x + b) ] 2 wth: α 0 = The partal dervates after w, b and the α lead after nsertng n the prmal OP to the equvalent: = l l l α 2 jαα j j = = j= maxmze L ( w, b, α) = y y x, x Wolf-duale OP: under the constrant: 0 α C (C= n case of no "slack") l and y α = 0 Ths s a postve sem-defnte problem, that can be solved numerc teratvely usng convex quadratc programmng!

21 l w = α yx = α yx b = x SV Soluton: Soluton of the dual problem yelds unquely the desred hyper plane (equatons also apply for the case wth slack varables!) ( ) ( ) max( w, x ) mn ( w, x ) max ( w, x ) mn ( w, x ) = + = y, = y, =+ x SVy, = x SV, y=+ classfcaton: f( x) = sgn( w, x + b ) = sgn( α y x, x + b ) x SV Observatons: For support vectors apples: 0 < α < For all non-support-vectors apples: α = 0 -> support vectors, sparse -representaton of soluton Unqueness of plane, global optmum!! Soluton only dependent on support vectors!!

22 The gradent s a normal vector to the curve g(x)=0 Consderng the Taylor expanson of g regardng a small vectoral nose ε at x on the curve g(x)=0 results n: T g( x + ε) = g( x) + ε g + If the nose ε moves along the curve, g(x+ ε) = g(x) apples and thus ε T g(x) = 0. From that we can see, that the gradent s orthogonal to the surface g(x)=0 (normal vector).

23 Constrant for a statonary pont x* f ( x) = c f ( x) = c x 2 x 2 f( x+ ε t) = f( x ε t) < f( x ) f t 2 ft, t g f g ( x) = 0 x f g g ( x) = 0 a) relatons at a non-statc pont b) relatons at a statc pont x x a) If the projecton of f to the tangental drecton t contans a contrbuton other than 0, the crteron for optmzaton can be mproved by moton along the curve of the sde condton n ths drecton! => no statonary pont! b) The qualty functon f(x) s worsened at a statc pont n both tangental drectons. f s orthogonal to t.

24 Example: x 2 ) f( x) = x + x f( x) = c ) NB: g( x) = x + x + = 0 2 f f x x 2 f x x 2 2 = = g g x g x 2 = = - x f g - x g ( x) = 0 L( x, λ) = f( x) + λg( x) necc. constr. for an optmum: L ) L= = f + λ g= 0 ) 2x + λ = 0 x 2x2 + λ = 0 L 2) = g( x) = 0 2) x+ x2 + = 0 λ f g 2 λ = and x = 2 fx ( ) = = λ ( ) = gx

25 Demos wth MATLAB (svmmatlab, uclass.m) Lnear classfer hard margn, problem separable hard, small margn due to outler, problem separable, but bad generalzaton => ncreasng emprcal errors wth profts at generlzaton => ntroduce soft margn Non-lnear classfer For non-lnear problems the VC-dmenson of the separatng hyperplanes and thus ther capacty have to be ncreased! Lnear separaton of a quadratc problem wth a soft margn Polynomal separaton (p=2), hard margn Polynomal separaton (p=4), hard margn Banana shape wth polynomal kernels Banana shape wth Gaussan radal bass functons Startng Matlab-Demo matlab-svm.bat

26 Lnearly separable classes; hard margn

27 Lnearly separable classes; hard margn, small margn, bad generalzaton

28 Non-separable case wx, + b = wx, + b = 0 ξ wx, + b =+ x Penalzng margn volatons wth slack -varables [Smth68] -> Soft-Margn SVM mnmzng: w 2 + C = wth: y ( wx, + b) ξ and ξ 0 Reasons for ths knd of generalzaton: there are no solutons wth prevous approach of hard margn generalzaton of outlers n the margn can be mproved at the cost of a slghtly ncreased emprcal error l ξ Now we have to dstngush three cases: 0<α <C SV wth ξ = 0 (le on canon. hyperplanes) α =C more SV wth ξ > 0 (falsely assgned) α =0 for the remanng vectors x C corresponds to separable approach wth hard margn!

29 Lnearly sep. classes; soft, broad margn => R emp greater, good generalzaton

30 Lnearly sep. classes; soft, broad margn => R emp greater, good generalzaton

31 Lnearly separable classes; hard margn

32 Lnearly separable classes; hard margn

33 Lnearly separable classes; hard margn

34 Lnearly separable classes; hard margn

35 Global, unque optmum Optmzaton of a quadratc problem n w (convex) under lnear constrants a global optmum results Admtted soluton areas are convex sets; more ntersected constrants also result n convex sets => when ntersectng these sets wth the orgnal problem the propertes of a global optmum are obtaned, even f ths optmum les on the margn! α 2 α [ 0,c ] 2 2 α y = 0 α [ 0,c ] α

36 Example: Desgnng a SVM for 2 ponts x 2 x 2 f ( ) = sgn(, ) x 0 = = y =+ y = 0 x 2 2 x w x Both vectors are support vectors! α 2 5 x x 4 3 w * L ( α) 2 2 = α 2 yyjαα j x x j α α2 2 α α2 = = j= ) maxmze L ( α) =, = ( + ) ( + ) 2) under sde cond.: 0 α < 3) and y α = ( α α ) = α Soluton: α = α = b 2 + = αy = 2 = SV x w x x x ( ) w x w x = max (, ) + mn (, ) = ( ) = x SV, y = x SV, y =+ classfcaton: f ( x) = sgn( w, x )

37 Example: Desgnng a SVM for 3 ponts x 2 x 2 x 0 0 = 2 = 3 = y = y2 =+ y3 = 0 λ x x w * λ x x 3 x case I: for 0 λ three support vectors case II: for other values: two support vect. 3 = α 2 yyjαα j x x j α α2 α3 2 α2 λα2α3 λ α3 = = j= ) maxmze L ( α) =, = ( + + ) ( 2 + ( + ) ) 2) under the sde cond.: 0 α < 3) and y α = ( α + α α ) = 0 α = α α ) n ) nserted results n: L ( α) = 2 α ( α 2 λα α + ( + λ ) α ) Soluton: α = 2( λ) α = 2 λ α = 2 λ for 0 λ are all α 0 (case I) thus apples: 2 3 w b 0 2 = αyx = 2λ 2 = 0 SV λ x ( ) w x w x =, + mn (0,, ) = ( 2+ 0) = x SV, y =+

38 case I: λ=0,5 absolute maxmum wthn the area α 0 case II: λ = - absolute maxmum outsde of the area α 0, soluton at the margn

39 Non-lnear problems some problems have non-lnear class boundares hyperplanes do not reach satsfyng accuracy

40 Expandng the hypothess space Idea: Fnd hyperplane n hgher dmensonal feature space dm X = N orgnal space X:= R N Ψ x : z feature space H dm H = M>>N The separatng hyperplane n the feature space s a non-lnear separatng area n the orgnal space (see XOR-problem wth polynom classfer)

41 Non-lnear problems One-dmensonal orgnal space: x Two-dmensonal feature space: Ψ ( x ) = z = z = x, z = x T 2 2 T z = x 2 2 x z lnearly not separable lnear separaton

42 Non-lnear problems orgnal space: x=(x,x 2 ) (two-dmensonal) feature space: T 2 2 Ψ ( x) = z = z = x, z2 = x2, z3 = 2 x, z4 = 2 x2, z5 = 2 xx2, z6 = T x 2 z = 2x 4 2 x z = x 2 x x > x 2 2 < x 2 2 non-lnear separaton z z > 2z 4 < 2z 4 lnear separaton

43 Non-lnear expanson Pre-compute non-lnear map example XOR-problem: x x 2 x Ψ = xx 2 x2 2 x2 N Ψ( x): H x Ψ( x )

44 Consequences Effect: ncreasng separablty separaton area n the orgnal space s non-lnear Questons:. Is the hyperplane optmal? 2. Is the complexty n hgh-dmensonal spaces hgh? ad : optmalty s sustaned, agan postve sem-defnte, snce the same scalar products occur n the functon to be optmzed, only n a new space H ad 2: the hgh complexty n the hgh-dmensonal feature space H can be reduced usng the kernel trck. The nner product n H has a equvalent formulaton wth kernel functon n the orgnal space X

45 The kernel trck Problem: Very hgh dmenson of the feature space! Polynoms of p-th degree above dmenson N of the orgnal space lead to O(M=N p ) dmensons of the feature space! Soluton: In the dual OP only scalar products <x,x j > occur. In the correspondng problem n the feature space also only scalar products n < Ψ(x ), Ψ(x j ) > occur. They do not need to be calculated explctly, but can be expressed wth reduced complexty usng kernel functons n the orgnal space X : Example: K ( x, x ) = Ψ( x ), Ψ ( x ) = z, z j j j T 2 2 For Ψ ( x) = z = z = x, z2 = x2, z3 = 2 x, z4 = 2 x2, z5 = 2 xx2, z6 = x x = x x + = Ψ x Ψ x 2 calculates K(, j) (, j ) ( ), ( j) the scalar product n the feature space. T

46 Kernel functons often used Polynom kernel K( x, x ) = ( x, x + ) j j 2 ( 2 ) 2 σ Gaussan kernel K( x, x ) = exp x x /(2 ) j j Sgmod kernel K( x, x ) = tanh( κ x, x + θ) j j The resultng classfers are comparable to polynom classfers, radal bass functons and neural networks (however they are motvated dfferently). General requrement: Mercer s constrant guarantees, that a certan kernel functon s n fact a scalar product n any space, but does not explan how the correspondng map Ψ and the space H look. Also: Lnear combnatons of vald kernels obtan new kernels (.e. the sum of two postve defnte functons s agan postve defnte).

47 Mercer s theorem There exsts a map Ψ and developement. K ( x, x ) = Ψ( x ), Ψ ( x ) j j exactly when for an arbrtrary g(x) wth ( ) 2 g d < x x apples, that K s a symmetrcal, postve semdefnte functon n x : K( x, x j) g( x) g( x j) dxdx j 0 Stll, n some cases the kernel functons do not meet the Mercer constrant, but lead to a postve sem-defnte Hessan matrx for a certan set of tranng data and thus converge to a global optmum.

48 Tranng: l = Fnal formulaton l l l maxmze: L( α) = α y y αα K( x, x ) l = j j j 2 = j= wth: yα = 0 and 0 α C (hard margn: C = ) = w* = α yψ ( x ) = α yψ( x ) x SV Classfyng an unknown object x (α 0 for all SV): l l b* = 2 max yjα jk(, j) + mn yjα jk(, j) x x x x y, = y, =+ j= j= = 2 max yjα jk(, j) + max yjα jk(, j) SV, y= x x SV, y= x x x x xj SV xj SV f( x) = sgn( w, Ψ ( x) + b ) = sgn( α y Ψ( x ), Ψ ( x) + b*) x SV ( l sgn (, ) *) = α sgn (, ) * yk x x + b = αyk + b = x x x SV (not calculated that way!)

49 Lnear separaton of a quadratc problem; soft margn

50 Polynomal separaton of a quadratc problem; hard margn

51 Polynomal separaton of a quadratc problem; hard margn

52 Non-lnearly separable classes; hard margn

53 Non-lnearly separable classes; hard margn

54 Example: sles, polynomal wth p = 2

55 Example: sles, polynomal wth p = 3

56 Example: sles, polynomal wth p = 4

57 Example: sles, RBFs wth σ=

59 Applcaton example: sgn recognton Example: US Postal Servce Dgts [Schö95/96] 6x6 grey scale mages 729 tranng, 2007 testng samples Results: classfer error rate human 2.5% 2-layer NN 5.9% 5-layer NN 5.% SVM (polynomal degree 3) 4.0% SVM + nvarance 3.2% SVM+nvarance: quntuple orgnal set of data by shftng the grey scale patterns n the 4 man drectons (N,S,O,W) about one pxel.

SVM wth RBF, VC-lmt compared to actual test error rate (only one parameter: σ) VC-lmt (analytc) Rtest ( σ ) 00 Unfortunately rough approxmates, but qualtatvely sgnfcant (mnmum at the same

60 SVM wth RBF, VC-lmt compared to actual test error rate (only one parameter: σ) VC-lmt (analytc) Rtest ( σ ) 00 Unfortunately rough approxmates, but qualtatvely sgnfcant (mnmum at the same locaton/place). Alternatve: seek σ by mnmzng test error through cross valdaton. Formng the expected value over all leave-one-out experments results n an estmate of the real rsk The cost of a leave-one-out experment can be reduced to checkng the SV, snce only the SV has nfluence on the measured error rate.

61 SVM: Propertes and complexty Advantages: Accordng to the current understandng, the SVM provdes very good results and fnds (certan condtons gven) a global mnmum (NN normally fnds only suboptmal solutons) Sparse-descrpton of the soluton over N s support vectors Easly applcable (few parameters (C,σ), no a-pror-knowledge needed, no desgn); however the exact choce of (C,σ) s one of the bggest dsadvantages n the desgn Geometrcally vsualzable functonalty Theoretcal statements about result: global optmum, generalzaton ablty SRM possble, but dffcult to manage also problems wth bg feature spaces can be solved ( tranng and testng patterns) If problem s semdefnte: the dual problem has a lot of solutons, but all lead to the same hyperplane!

62 Dsadvantages: VC-estmate very mprecse and not very utlsable n practce mult-class approach stll matter of research approach: one SVM per class,.e. cost and space complexty ncreases wth number of classes /3 /2 2/3 2 3 Reducton of a mult-class problem to a seres of bnary decson problems

63 Dsadvantages: No quanttatve qualty-statement about classfcaton Slow, space-ntensve learnng: 3 Runtme: ON ( s ) (nverson of Hessan-matrx/Newton-alg.) 2 Space: ON ( s ) (however n practce less than quadratc) Approach for mprovement: dvde nto partal problems Slow classfcaton wth O(MN s ), wth M = dm(h) (.e. n the case wthout kernel functons M=N=dm(x )), but for appropately chosen kernel functons also apples: M=N.

64 Lterature: () C.J.C. Burges, A tutoral on support vector machnes for pattern recognton, Knowledge Dscovery and Data Mnng, 2(2), 998. ( (2) V. Vapnk, Statstcal Learnng Theory, Wley, New York, 998. (3) N. Crstann, J. Shawe-Taylor, An ntroducton to support vector machnes and other kernel-based learnng methods, Cambrdge Unv. Press, Cambrdge (4) B. Schölkopf, A. J. Smola, Learnng wth kernels, MIT Press, Boston, (5) S.R. Gunn, Support Vector Machnes for Classfcaton and Regresson, Techncal Report, Department of Electroncs and Computer Scence, Unversty of Southampton, 998.

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Support Vector Machines. Vibhav Gogate The University of Texas at dallas Support Vector Machnes Vbhav Gogate he Unversty of exas at dallas What We have Learned So Far? 1. Decson rees. Naïve Bayes 3. Lnear Regresson 4. Logstc Regresson 5. Perceptron 6. Neural networks 7. K-Nearest