Natural Language Processng and Informaton Retreval Support Vector Machnes Alessandro Moschtt Department of nformaton and communcaton technology Unversty of Trento Emal: moschtt@ds.untn.t
Summary Support Vector Machnes Hard-margn SVMs Soft-margn SVMs
Whch hyperplane choose?
Classfer wth a Maxmum Margn Var 1 IDEA 1: Select the hyperplane wth maxmum margn Margn Margn Var 2
Support Vector Var 1 Support Vectors Margn Var 2
Support Vector Machne Classfers Var 1 The margn s equal to 2 k w w! w!! x! + b = k w! " x! + b =! k k k w!! x! + b = 0 Var 2
Support Vector Machnes Var 1 The margn s equal to We need to solve 2 k w w! w! " x! + b =! k k w!! x! + b = k k w!! x! + b Var 2 = 0 max 2! k w! w " x! + b # +k, f x! s postve! w " x! + b $ %k, f x! s negatve
Support Vector Machnes Var 1 There s a scale for whch k=1. The problem transforms n: w!!! w! x + b = " 1 1!! w! x + b = 1 1 w!! x! + b Var 2 = 0 max 2! w! w " x! + b # +1, f x! s postve! w " x! + b $ %1, f x! s negatve
Fnal Formulaton max 2! w! w " x! + b # +1, y =1! w " x! + b $ %1, y = -1 " max 2! w y ( w! " x! + b) #1 " mn! w 2 " " y ( w! " x! + b) #1 mn! w 2 2 y ( w! " x! + b) #1
Optmzaton Problem Optmal Hyperplane: Mnmze Subject to " (! w ) = 1 2! w 2 y ((! w #! x ) + b) $ 1, = 1,...,m The dual problem s smpler
Lagrangan Defnton "
Dual Optmzaton Problem
Dual Transformaton Gven the Lagrangan assocated wth our problem To solve the dual problem we need to evaluate: Let us mpose the dervatves to 0, wth respect to w!
Dual Transformaton (cont d) and wrt b Then we substtuted them n the Lagrange functon
Fnal Dual Problem
Khun-Tucker Theorem Necessary and suffcent condtons to optmalty L( w, α, β ) = 0 w L( w, α, β ) = 0 b α g ( w ) = 0, =1,.., m g ( w ) 0, =1,.., m α 0, =1,.., m
Propertes comng from constrants Lagrange constrants: m!! y = 0 =1 Karush-Kuhn-Tucker constrants! w = m!!! y x =1 " #[y (! x #! w + b) $1] = 0, = 1,...,m! Support Vectors have not null To evaluate b, we can apply the followng equaton
Warnng! On the graphcal examples, we always consder normalzed hyperplane (hyperplanes wth normalzed gradent) b n ths case s exactly the dstance of the hyperplane from the orgn So f we have an equaton not normalzed we may have! x "! w '+b = 0 wth! x = x,y ( ) and! w '= 1,1 ( ) and b s not the dstance
Warnng! Let us consder a normalzed gradent! w = 1/ 2,1/ 2 ( ) ( x,y) " ( 1/ 2,1/ 2) + b = 0 # x / 2 + y/ 2 = $b # y = $x $ b 2 Now we see that -b s exactly the dstance. For x =0, we have the ntersecton wth. Ths! w dstance projected on s -b "b 2
Soft Margn SVMs Var 1!! slack varables are added w!!! w! x + b = 1 Some errors are allowed but they should penalze the objectve functon!! w! x + b = " 1 1 1 w!! x! + b = 0 Var 2
Soft Margn SVMs Var 1! The new constrants are y ( w! " x! + b) #1$% & x! where % # 0 w!!! w! x + b = " 1 1!! w! x + b = 1 1 w!! x! + b Var 2 = 0 The objectve functon penalzes the ncorrect classfed examples mn 1 2 w! 2 +C # " C s the trade-off between margn and the error
Dual formulaton mn 1 2 w + C m =1 ξ2 y ( w x + b) 1 ξ, =1,.., m ξ 0, =1,.., m L( w, b, ξ, α )= 1 2 w w + C 2 m ξ 2 m α [y ( w x + b) 1+ξ ], =1 =1 By dervng wrt! w,! " and b
Partal Dervatves
Substtuton n the objectve functon! j of Kronecker
Fnal dual optmzaton problem
Soft Margn Support Vector Machnes The algorthm tres to keep ξ low and maxmze the margn NB: The number of error s not drectly mnmzed (NP-complete problem); the dstances from the hyperplane are mnmzed If C, the soluton tends to the one of the hard-margn algorthm Attenton!!!: f C = 0 we get w! = 0, snce mn 1 2 w! 2 +C # " y (! w "! x + b) #1$% &! x % # 0 y b "1#$ %! x If C ncreases the number of error decreases. When C tends to nfnte the number of errors must be 0,.e. the hard-margn formulaton
Robusteness of Soft vs. Hard Margn SVMs Var 1 Var 1! ξ w!! x! + b = 0 Var 2 w!! x! + b = 0 Var 2 Soft Margn SVM Hard Margn SVM
Soft vs Hard Margn SVMs Soft-Margn has ever a soluton Soft-Margn s more robust to odd examples Hard-Margn does not requre parameters
Parameters mn 1 2 w! 2 +C! = mn 1 2 w! 2 +C +!! + + C "!! (! " ) = mn 1 2 w! 2 +C J!! + +!! " C: trade-off parameter J: cost factor
Theoretcal Justfcaton
Defnton of Tranng Set error Tranng Data N f : R! { ±1} (! Emprcal Rsk (error) Rsk (error) R emp [ f ] = 1 m x 1,y 1 ),...,( x! m,y m ) " R N # ±1 m # =1 1 2 f ( x! ) " y R[ f ] = 1 2 f (! x ) " y dp(! x,y) # { }
Error Characterzaton (part 1) From PAC-learnng Theory (Vapnk): R(") # R emp (") + $( d m, log(% ) m ) $( d m, log(% ) m ) = d (log 2m d +1)&log(% 4 ) m where d s thevc-dmenson, m s the number of examples, δ s a bound on the probablty to get such error and α s a classfer parameter.
There are many versons for dfferent bounds
Error Characterzaton (part 2)
Rankng, Regresson and Multclassfcaton
The Rankng SVM [Herbrch et al. 1999, 2000; Joachms et al. 2002] The am s to classfy nstance pars as correctly ranked or ncorrectly ranked Ths turns an ordnal regresson problem back nto a bnary classfcaton problem We want a rankng functon f such that x > x j ff f(x ) > f(x j ) or at least one that tres to do ths wth mnmal error Suppose that f s a lnear functon f(x ) = w x
Sec. 15.4.2 The Rankng SVM Rankng Model: f x f (x )
Sec. 15.4.2 The Rankng SVM Then (combnng the two equatons on the last slde): x > x j ff w x w x j > 0 x > x j ff w (x x j ) > 0 Let us then create a new nstance space from such pars: z k = x x k y k = +1, 1 as x, < x k
Support Vector Rankng mn 1 2 w + C m =1 ξ2 y k ( w ( x x j )+b) 1 ξ k, ξ k 0, k =1,.., m 2, j =1,.., m y k = 1 f rank( x ) > rank( x j ),"1 0 otherwse, where k = m + j Gven two examples we buld one example (x, x j )
Support Vector Regresson (SVR) f(x) ( x) = wx b f + +ε 0 -ε Mn y 1 2 Constrants: w T Soluton: w x T w b ε w T x + b y ε x
Support Vector Regresson (SVR) f(x) ( x) = wx b f + ξ ξ* +ε 0 -ε Mnmse: N 1 w + C " 2 wt # + " y w * ξ, ξ T x w T Constrants: x + b 0 b y ( * ) =1 ε + ξ ε + ξ * x
Support Vector Regresson y s not -1 or 1 anymore, now t s a value ε s the tollerance of our functon value
From Bnary to Multclass classfers Three dfferent approaches: ONE-vs-ALL (OVA) Gven the example sets, {E1, E2, E3, } for the categores: {C1, C2, C3, } the bnary classfers: {b1, b2, b3, } are bult. For b1, E1 s the set of postves and E2 E3 s the set of negatves, and so on For testng: gven a classfcaton nstance x, the category s the one assocated wth the maxmum margn among all bnary classfers
From Bnary to Multclass classfers ALL-vs-ALL (AVA) Gven the examples: {E1, E2, E3, } for the categores {C1, C2, C3, } buld the bnary classfers: {b1_2, b1_3,, b1_n, b2_3, b2_4,, b2_n,,bn-1_n} by learnng on E1 (postves) and E2 (negatves), on E1 (postves) and E3 (negatves) and so on For testng: gven an example x, all the votes of all classfers are collected where b E1E2 = 1 means a vote for C1 and b E1E2 = -1 s a vote for C2 Select the category that gets more votes
From Bnary to Multclass classfers Error Correctng Output Codes (ECOC) The tranng set s parttoned accordng to bnary sequences (codes) assocated wth category sets. For example, 10101 ndcates that the set of examples of C1,C3 and C5 are used to tran the C 10101 classfer. The data of the other categores,.e. C2 and C4 wll be negatve examples In testng: the code-classfers are used to decode one the orgnal class, e.g. C 10101 = 1 and C 11010 = 1 ndcates that the nstance belongs to C1 That s, the only one consstent wth the codes
SVM-lght: an mplementaton of SVMs Implements soft margn Contans the procedures for solvng optmzaton problems Bnary classfer Examples and descrptons n the web ste: http://www.joachms.org/ (http://svmlght.joachms.org/)
Structured Output
Mult-classfcaton
References A tutoral on Support Vector Machnes for Pattern Recognton Downloadable artcle (Chrss Burges) The Vapnk-Chervonenks Dmenson and the Learnng Capablty of Neural Nets Downloadable Presentaton Computatonal Learnng Theory (Sally A Goldman Washngton Unversty St. Lous Mssour) Downloadable Artcle AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel-based learnng methods) N. Crstann and J. Shawe-Taylor Cambrdge Unversty Press Check our lbrary The Nature of Statstcal Learnng Theory Vladmr Naumovch Vapnk - Sprnger Verlag (December, 1999) Check our lbrary
Exercse 1. The equatons of SVMs for Classfcaton, Rankng and Regresson (you can get them from my sldes). 2. The perceptron algorthm for Classfcaton, Rankng and Regresson (the last two you have to provde by lookng at what you wrote n pont (1)). 3. The same as pont (2) by usng kernels (wrte the kernel defnton as ntroducton of ths secton).