Supprt-Vectr Machines Intrductin Supprt vectr machine is a linear machine with sme very nice prperties. Haykin chapter 6. See Alpaydin chapter 13 fr similar cntent. Nte: Part f this lecture drew material frm Ricard Gutierrez-Osuna s Pattern Analysis lectures. The basic idea f SVM is t cnstruct a separating hyperplane where the margin f separatin between psitive and negative eamples are maimized. Principled derivatin: structural risk minimizatin errr rate is bunded by: (1) training errr-rate and (2) VC-dimensin f the mdel. SVM makes (1) becme zer and minimizes (2). 1 2 Optimal Hyperplane Distance t the Optimal Hyperplane Fr linearly separable patterns {( i, d i )} N (with d i {+1, 1}): i d+r w The separating hyperplane is w T + b = 0: w T + b 0 fr d i = +1 θ d r w T + b < 0 fr d i = 1 Let w be the ptimal hyperplane and b the ptimal bias. Frm w T i = b, the distance frm the rigin t the hyperplane is calculated as: d = i cs( i, w ) = b w since w T i = w i cs(w, i ) = b 3 4
Distance t the Optimal Hyperplane (cnt d) i w d+r θ d r Optimal Hyperplane and Supprt Vectrs ρ Optimal hyperplane The distance frm an arbitrary pint t the hyperplane can be calculated as: When the pint is in the psitive area: r = cs(, w ) d = T w w + w = T w + b w When the pint is in the negative area: r = d cs(, w ) = T w w w = T w + b w 5 b b.. Supprt Vectrs Supprt vectrs: input pints clsest t the separating hyperplane. Margin f separatin ρ: distance between the separating hyperplane and the clsest input pint. 6 Optimal Hyperplane and Supprt Vectrs (cnt d) The ptimal hyperplane is suppsed t maimize the margin f separatin ρ. With that requirement, we can write the cnditins that w and b must meet: w T + b +1 fr d i = +1 w T + b 1 fr d i = 1 Nte: +1 and 1, and supprt vectrs are thse (s) where equality hlds (i.e., w T (s) + b = +1 r 1). Since r = (w T + b )/ w, 1/ w if d = +1 r = 1/ w if d = 1 7 Optimal Hyperplane and Supprt Vectrs (cnt d) Supprt Vectrs Optimal hyperplane Margin f separatin between tw classes is ρ ρ = 2r = 2 w. Thus, maimizing the margin f separatin between tw classes is equivalent t minimizing the Euclidean nrm f the weight w! 8
Primal Prblem: Cnstrained Optimizatin Fr the training set T = {( i, d i )} N find w and b such that they minimize a certain value (1/ρ) while satisfying a cnstraint (all eamples are crrectly classified): Cnstraint: d i (w T i + b) 1 fr i = 1, 2,..., N. Cst functin: Φ(w) = 1 2 wt w. This prblem can be slved using the methd f Lagrange multipliers (see net tw slides). Mathematical Aside: Lagrange Multipliers Turn a cnstrained ptimizatin prblem int an uncnstrained ptimizatin prblem by absrbing the cnstraints int the cst functin, weighted by the Lagrange multipliers. Eample: Find pint n the circle 2 + y 2 = 1 clsest t the pint (2, 3) (adapted frm Ballard, An Intrductin t Natural Cmputatin, 1997, pp. 119 120). Minimize F (, y) = ( 2) 2 + (y 3) 2 subject t the cnstraint 2 + y 2 1 = 0. Absrb the cnstraint int the cst functin, after multiplying the Lagrange multiplier α: F (, y, α) = ( 2) 2 + (y 3) 2 + α( 2 + y 2 1). 9 10 Lagrange Multipliers (cnt d) Must find, y, α that minimizes F (, y, α) = ( 2) 2 + (y 2) 2 + α( 2 + y 2 1). Set the partial derivatives t 0, and slve the system f equatins. F F y = 2( 2) + 2α = 0 = 2(y 2) + 2αy = 0 F α = 2 + y 2 1 = 0 Slve fr and y in the 1st and 2nd, and plug in thse t the 3rd equatin = y = 2 ( ) 2 2 ( ) 2 2 1 + α, s + = 1 1 + α 1 + α frm which we get α = 2 2 1. Thus, (, y) = (1/ 2, 1/ 2). 11 Primal Prblem: Cnstrained Optimizatin (cnt d) Putting the cnstrained ptimizatin prblem int the Lagrangian frm, we get (utilizing the Kunh-Tucker therem) J(w, b, α) = 1 2 wt w Frm J(w,b,α) w = 0: Frm J(w,b,α) b = 0: w = ] α i [d i (w T i + b) 1. α i d i i. α i d i = 0 12
Primal Prblem: Cnstrained Optimizatin (cnt d) Nte that when the ptimal slutin is reached, the fllwing cnditin must hld (Karush-Kuhn-Tucker cmplementary cnditin) fr all i = 1, 2,..., N. ] α i [d i (w T i + b) 1 = 0 Thus, nn-zer α i s can be attained nly when [ di (w T i + b) 1 ] = 0, i.e., when the α i is assciated with a supprt vectr (s)! Other cnditins include α i 0. Primal Prblem: Cnstrained Optimizatin (cnt d) Plugging in w = N α id i N i and α id i = 0 back int J(w, b, α), we get the dual prblem. J(w, b, α) = 1 2 wt w N α i [ ] d i (w T i + b) 1 = 1 2 wt w N α id i w T i b N α id i + N α i { nting w T w = N α id i w T i } N and frm α id i = 0 N α id i w T i + N = 1 2 = 1 N 2 = Q(α). α i N j=1 α iα j d i d j T i j + N α i S, J(w, b, α) = Q(α) (α i 0). This results in the dual prblem (net slide). 13 14 Dual Prblem Given the training sample {( i, d i )} N, find the Lagrange multipliers {α i } N Q(α) = 1 2 that maimize the bjective functin: subject t the cnstraints N α id i = 0 j=1 α i 0 fr all i = 1, 2,..., N. α i α j d i d j T i j + The prblem is stated entirely in terms f the training data ( i, d i ), and the dt prducts T i j play a key rle. 15 α i Slutin t the Optimizatin Prblem Once all the ptimal Lagrange mulitpliers α,i are fund (use Sequential minimal ptimizatin, etc.), w and b can be fund as fllws: w = α,i d i i and frm w T i + b = d i when i is a supprt vectr: b = d (s) w T (s) Nte: calculatin f final estimated functin des nt need any eplicit calculatin f w since they can be calculated frm the dt prduct between the input vectrs! w T = α,i d i T i 16
Margin f Separatin in SVM and VC Dimensin Statistical learning thery shws that it is desirable t reduce bth the errr (empirical risk) and the VC dimensin f the classifier. Vapnik (1995, 1998) shwed: Let D be the diameter f the smallest ball cntaining all input vectrs i. The set f ptimal hyperplanes defined by w T + b = 0 has a VC dimensin h bunded frm abve as { D 2 h min ρ 2, m 0 } + 1 where is the ceiling, ρ the margin f separatin equal t 2/ w, and m 0 the dimensinality f the input space. The implicatin is that the VC dimensin can be cntrlled independetly f m 0, by chsing an apprpriate (large) ρ! 17 Sft-Margin Classificatin ρ Optimal hyperplane Supprt Vectrs Inside margin, crrectly classified Inside margin, incrrectly classified Sme prblems can vilate the cnditin: d i (w T i + b) 1 We can intrduce a new set f variables {ξ i } N : d i (w T i + b) 1 ξ i where ξ i is called the slack variable. 18 Sft-Margin Classificatin (cnt d) We want t find a separating hyperplane that minimizes: Φ(ξ) = I(ξ i 1) where I(ξ) = 0 if ξ 0 and 1 therwise. Slving the abve is NP-cmplete, s we instead slve an apprimatin: Φ(ξ) = ξ i Sft-Margin Classificatin: Slutin Fllwing a similar rute invlving Lagrange multipliers, and a mre restrictive cnditin f 0 α i C, we get the slutin: N s w = α,i d i i b = d i (1 ξ i ) w T i Furthermre, the weight vectr can be factred in: Φ(, ξ) = 1 2 wt w + C ξ i }{{} }{{} Cntrls VC dim Cntrls errr with a cntrl parameter C. 19 20
Nnlinear SVM Inner-Prduct Kernel Input is mapped t ϕ(). Input space ( ) ( i ) With the weight w (including the bias b), the decisin surface in the feature space becmes (assume ϕ 0 () = 1): i w T ϕ() = 0 Feature space Using the steps in linear SVM, we get Nnlinear mapping f an input vectr t a high-dimensinal feature space (eplit Cver s therem) Cnstructin f an ptimal hyperplane fr separating the features identified in the abve step. 21 w = α i d i ϕ( i ) Cmbining the abve tw, we get the decisin surface α i d i ϕ T ( i )ϕ() = 0. 22 Inner-Prduct Kernel (cnt d) The inner prduct ϕ T ()ϕ( i ) is between tw vectrs in the feature space. The calculatin f this inner prduct can be simpified by use f a inner-prduct kernel K(, i ): m 1 K(, i ) = ϕ T ()ϕ( i ) = ϕ j ()ϕ j ( i ) j=0 where m 1 is the dimensin f the feature space. (Nte: K(, i ) = K( i, ).) Inner-Prduct Kernel (cnt d) Mercer s therem states that K(, i ) that fllw certain cnditins (cntinuus, symmetric, psitive semi-definite) can be epressed in terms f an inner-prduct in a nnlinearly mapped feature space. Kernel functin K(, i ) allws us t calculate the inner prduct ϕ T ()ϕ( i ) in the mapped feature space withut any eplicit calculatin f the mapping functin ϕ( ). S, the ptimal hyperplane becmes: α i d i K(, i ) = 0 23 24
Eamples f Kernel Functins Linear: K(, i ) = T i. Plynmial: K(, i ) = ( T i + 1) p. ( ) RBF: K(, i ) = ep 1 2σ 2 i 2. Tw-layer perceptrn: K(, i ) = tanh ( β 0 T i + β 1 ) (fr sme β 0 and β 1 ). Epanding Kernel Eample K(, i ) = (1 + T i ) 2 with = [ 1, 2 ] T, i = [ i1, i2 ] T, K(, i ) = 1 + 2 1 2 i1 + 2 1 2 i1 i2 = + 2 2 2 i2 + 2 1 i1 + 2 2 i2 [1, 2 1, 2 1 2, 2 2, 2 1, 2 2 ] [1, 2 i1, 2 i1 i2, 2 i2, 2 i1, 2 i2 ] T = ϕ() T ϕ( i ), where ϕ() = [1, 2 1, 2 1 2, 2 2, 2 1, 2 2 ] T. 25 26 Nnlinear SVM: Slutin The slutin is basically the same as the linear case, where T i is replaced with K(, i ), and an additinal cnstraint that α C is added. Nnlinear SVM Summary Prject input t high-dimensinal space t turn the prblem int a linearly separable prblem. Issues with a prjectin t higher dimensinal feature space: Statistical prblem: Danger f invking curse f dimensinality and higher chance f verfitting Use large margins t reduce VC dimensin Cmputatinal prblem: cmputatinal verhead fr calculating the mapping ϕ( ): Slve by using the kernel trick. 27 28