Soft-margin SVM can address linearly separable problems with outliers

Size: px

Start display at page:

Download "Soft-margin SVM can address linearly separable problems with outliers"

Peter Reynolds
5 years ago
Views:

1 Non-linear Support Vector Machines Non-linearly separable probles Hard-argin SVM can address linearly separable probles Soft-argin SVM can address linearly separable probles with outliers Non-linearly separable probles need a higher expressive power (i.e. ore coplex feature cobinations) We do not want to loose the advantages of linear separators (i.e. large argin, theoretical guarantees) Solution Map input exaples in a higher diensional feature space Perfor linear classification in this higher diensional space Non-linear Support Vector Machines feature ap Φ : X H Φ is a function apping each exaple to a higher diensional space H Exaples x are replaced with their feature apping Φ(x) The feature apping should increase the expressive power of the representation (e.g. introducing features which are cobinations of input features) Exaples should be (approxiately) linearly separable in the apped space Feature ap Hoogeneous (d = 2) Inhoogeneous (d = 2) Polynoial apping ( ) x1 Φ = x 2 ( ) x1 Φ = x 2 x 2 1 x 1 x 2 x 2 2 x 1 x 2 x 2 1 x 1 x 2 x 2 2 Maps features to all possible conjunctions (i.e. products) of features: 1. of a certain degree d (hoogeneous apping) 2. up to a certain degree (inhoogeneous apping) 1

2 Feature ap Φ Non-linear Support Vector Machines f(x) Linear separation in feature space SVM algorith is applied just replacing x with Φ(x): f(x) = w T Φ(x) + w 0 A linear separation (i.e. hyperplane) in feature space corresponds to a non-linear separation in input space, e.g.: ( ) x1 f = sgn(w x 1 x w 2 x 1 x 2 + w 3 x w 0 ) 2 Rationale Retain cobination of regularization and data fitting Regularization eans soothness (i.e. saller weights, lower coplexity) of the learned function Use a sparsifying loss to have few support vector 2

3 ɛ-insensitive loss l(f(x), y) = y f(x) ɛ = { 0 if y f(x) ɛ y f(x) ɛ otherwise Tolerate sall (ɛ) deviations fro the true value (i.e. no penalty) Defines an ɛ-tube of insensitiveness around true values This also allows to trade off function coplexity with data fitting (playing on ɛ value) Optiization proble Note 1 in w X,w 0 IR, ξ,ξ IR 2 w 2 + C subject to (ξ i + ξi ) w T Φ(x i ) + w 0 y i ɛ + ξ i y i (w T Φ(x i ) + w 0 ) ɛ + ξ i ξ i, ξ i 0 Two constraints for each exaple for the upper and lower sides of the tube Slack variables ξ i, ξ i penalize predictions out of the ɛ-insensitive tube Lagrangian We include constraints in the iniization function using Lagrange ultipliers (α i, α i, β i, βi 0): L = 1 2 w 2 + C (ξ i + ξi ) (β i ξ i + βi ξi ) α i (ɛ + ξ i + y i w T Φ(x i ) w 0 ) αi (ɛ + ξi y i + w T Φ(x i ) + w 0 ) 3

4 Vanishing the derivatives wrt the prial variables we obtain: w = w (αi α i )Φ(x i ) = 0 w = (αi α i )Φ(x i ) w 0 = (α i αi ) = 0 ξ i = C α i β i = 0 α i [0, C] = C αi βi = 0 αi [0, C] ξ i Substituting in the Lagrangian we get: (αi α i )(αj α j )Φ(x i ) T Φ(x j ) } {{ } w 2 ξ i (C β i α i ) + =0 ξ i (C β i α i ) =0 ɛ (α i + αi ) + y i (αi α i ) + w 0 (αi α i )(αj α j )Φ(x i ) T Φ(x j ) (α i α i ) } {{ } =0 ax α IR 1 2 subject to (αi α i )(αj α j )Φ(x i ) T Φ(x j ) ɛ (αi + α i ) + y i (αi α i ) (α i αi ) = 0 α i, α i [0, C] i [1, ] 4

5 Regression function f(x) = w T Φ(x) + w 0 = (α i αi )Φ(x i ) T Φ(x) + w 0 Karush-Khun-Tucker conditions (KKT) At the saddle point it holds that for all i: α i (ɛ + ξ i + y i w T Φ(x i ) w 0 ) = 0 αi (ɛ + ξi y i + w T Φ(x i ) + w 0 ) = 0 β i ξ i = 0 βi ξi = 0 Cobined with C α i β i = 0,α i 0,β i 0 and C α i β i = 0,α i 0,β i 0 we get α i [0, C] α i [0, C] and α i = C if ξ i > 0 α i = C if ξ i > 0 Support Vectors All patterns within the ɛ-tube, for which f(x i ) y i < ɛ, have α i, αi estiated function f. = 0 and thus don t contribute to the Patterns for which either 0 < α i < C or 0 < α i < C are on the border of the ɛ-tube, that is f(x i) y i = ɛ. They are the unbound support vectors. The reaining training patterns are argin errors (either ξ i > 0 or ξi > 0), and reside out of the ɛ-insensitive region. They are bound support vectors, with corresponding α i = C or αi = C. Support Vectors 5

6 : exaple for decreasing ɛ References Non linear SVM C. Burges, A tutorial on support vector achines for pattern recognition, Data Mining and Knowledge Discovery, 2(2), , Support vector regression J.Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis, Cabridge University Press, 2004 (Section 7.3.3) Appendix Sallest enclosing hypersphere Support vector ranking Sallest Enclosing Hypersphere Rationale Characterize a set of exaples defining boundaries enclosing the 6

7 Find sallest hypersphere in feature space enclosing data points Account for outliers paying a cost for leaving exaples out of the sphere Usage One-class classification: odel a class when no negative exaples exist Anoaly/novelty detection: detect test data falling outside of the sphere and return the as novel/anoalous (e.g. intrusion detection systes, Alzheier s patients onitoring) Sallest Enclosing Hypersphere Optiization proble in R IR,o H,ξ IR R 2 + C ξ i subject to Φ(x i ) o 2 R 2 + ξ i i = 1,..., ξ i 0, i = 1,..., Note o is the center of the sphere R is the radius which is iniized slack variables ξ i gather costs for outliers Sallest Enclosing Hypersphere Lagrangian (α i, β i 0) L = R 2 + C ξ i α i (R 2 + ξ i Φ(x i ) o 2 ) β i ξ i Vanishing the derivatives wrt prial variables R = 2R(1 α i ) = 0 α i = 1 o = 2 α i (Φ(x i ) o)( 1) = 0 o ξ i = C α i β i = 0 α i [0, C] α i =1 = α i Φ(x i ) 7

8 Sallest Enclosing Hypersphere R 2 (1 α i ) + ξ i (C α i β i ) =0 =0 + α i (Φ(x i ) α j Φ(x j )) T (Φ(x i ) α h Φ(x h )) j=1 o h=1 o Sallest Enclosing Hypersphere α i (Φ(x i ) α j Φ(x j )) T (Φ(x i ) α h Φ(x h )) = j=1 = α i Φ(x i ) T Φ(x i ) α i Φ(x i ) T α i j=1 α j Φ(x j ) T Φ(x i ) + h=1 h=1 α i j=1 =1 = α i Φ(x i ) T Φ(x i ) α i Φ(x i ) T α h Φ(x h ) α j Φ(x j ) T j=1 α j Φ(x j ) h=1 α h Φ(x h ) = Sallest Enclosing Hypersphere ax α IR subject to α i Φ(x i ) T Φ(x i ) α i α j Φ(x i ) T Φ(x j ) α i = 1, 0 α i C, i = 1,...,. Distance function The distance of a point fro the origin is: R 2 (x) = Φ(x) o 2 = (Φ(x) α i Φ(x i )) T (Φ(x) α j Φ(x j )) = Φ(x) T Φ(x) 2 j=1 α i Φ(x i ) T Φ(x) + α i α j Φ(x i ) T Φ(x j ) 8

9 Sallest Enclosing Hypersphere Karush-Khun-Tucker conditions (KKT) At the saddle point it holds that for all i: β i ξ i = 0 α i (R 2 + ξ i Φ(x i ) o 2 ) = 0 Support vectors Unbound support vectors (0 < α i < C), whose iages lie on the surface of the enclosing sphere. Bound support vectors (α i = C), whose iages lie outside of the enclosing sphere, which correspond to outliers. All other points (α = 0) with iages inside the enclosing sphere. Sallest Enclosing Hypersphere Sallest Enclosing Hypersphere Decision function The radius R of the enclosing sphere can be coputed using the distance function on any unbound support vector x: R 2 (x) = Φ(x) T Φ(x) 2 α i Φ(x i ) T Φ(x) + α i α j Φ(x i ) T Φ(x j ) 9

10 A decision function for novelty detection could be: f(x) = sgn ( R 2 (x) (R ) 2) i.e. positive if the exaples lays outside of the sphere and negative otherwise Support Vector Ranking Rationale Order exaples by relevance (e.g. eail urgence, ovie rating) Learn scoring function predicting quality of exaple Constraint function to score x i higher than x j if it is ore relevant (pairwise coparisons for training) Easily foralized as a support vector classification task Support Vector Ranking Optiization proble in w X,w 0 IR, ξ i,j IR 1 2 w 2 + C i,j ξ i,j subject to w T Φ(x i ) w T Φ(x j ) 1 ξ i,j ξ i,j 0 i, j : x i x j Note There is one constraint for each pair of exaples having ordering inforation (x i x j eans the forer is coes first in the ranking) Exaples should be correctly ordered with a large argin Support Vector Ranking Support vector classification on pairs in w X,w 0 IR, ξ i,j IR subject to 1 2 w 2 + C i,j ξ i,j y i,j w T (Φ(x i ) Φ(x j )) 1 ξ i,j Φ(x ij) ξ i,j 0 i, j : x i x j where labels are always positive y i,j = 1 10

11 Support Vector Ranking Decision function f(x) = w T Φ(x) Standard support vector classification function (unbiased) Represents score of exaple for ranking it 11

Non-linear Support Vector Machines

Non-linear Support Vector Machines Andrea Passerini passerini@disi.unitn.it Machine Learning Non-linear Support Vector Machines Non-linearly separable problems Hard-margin SVM can address linearly separable