7.1 Support Vector Machine

Size: px

Start display at page:

Download "7.1 Support Vector Machine"

Marshall Shepherd
5 years ago
Views:

1 67577 Intro. to Machine Learning Fall semester, 006/7 Lecture 7: Support Vector Machines an Kernel Functions II Lecturer: Amnon Shashua Scribe: Amnon Shashua 7. Support Vector Machine We return now to the primal problem eqn. hyperplane with margin errors: 6.3 representing the maximal margin separating l min w,b,ɛ i w w + ν subect to i ɛ i y i w x i b ɛ i ɛ i 0 i,..., m We will now erive the Lagrangian Dual of this problem. By oing so a new ey property will emerge facilitate by the fact that the criteria function θµ note there are no equality constraints thus there is no nee for λ involves only inner-proucts of the training instance vectors x i. This property will form the ey of mapping the original input space of imension n to a higher imensional space thereby allowing for non-linear ecision surfaces for separating the training ata. Note that with this particular problem the strong uality conitions are satisfie because the criteria function an the inequality constraints form a convex set. The Lagrangian taes the following form: Recall that Lw, b, ɛ i, µ m w w + ν ɛ i i µ i [y i w x i b + ɛ i ] i θµ min Lw, b, ɛ, µ, δ. w,b,ɛ δ i ɛ i Since the minimum is obtaine at the vanishing partial erivatives of the Lagrangian with respect to w, b, the next step woul be to evaluate those constraints an substitute them bac into L to obtain θµ: w w i b i µ i y i x i 0 7. µ i y i 0 i 7. ɛ i ν µ i δ i

2 Lecture 7: Support Vector Machines an Kernel Functions II 7- From the first constraint 7. we obtain w i µ iy i x i, that is, w is escribe by a linear combination of a subset of the training instances. The reason that not all instances participate in the linear superposition is ue to the KKT conitions: µ i 0 when y i w x i b >, i.e., the instance x i is classifie correctly an is not a bounary point, an conversely, µ i > 0 when y i w x i b ɛ i, i.e., when x i is a bounary point or when x i is a margin error ɛ i > 0 note that for a margin error instance the value of ɛ i woul be the smallest possible require to reach an equality in the constraint because the criteria function penalizes large values of ɛ i. The bounary points an the margin errors are calle support vectors thus w is efine by the support vectors only. The thir constraint 7.3 is equivalent to the constraint: 0 µ i ν i,..., l, since δ i 0. Substituting these results/constraints bac into the Lagrangian L we obtain the ual problem: max θµ µ,...,µ m µ i µ i µ y i y x i x 7.4 i subect to i, 0 µ i ν i,..., m y i µ i 0 i The criterion function θµ can be written in a more compact manner as follows: Let M be a l l matrix whose entries are M i y i y x i x then θµ µ µ Mµ where is the vector of,..., an µ is the vector µ,..., µ m an µ is the transpose row vector. Note that M is positive efinite, i.e., x Mx > 0 for all vectors x 0 a property which will be important later. The ey feature of the ual problem is not so much that it is simpler than the primal in fact it isn t since the primal has no equality constraints or that it has a more elegant feel, the ey feature is that the problem is completely escribe by the inner proucts of the training instances x i, i,..., m. This fact will be shown to be a crucial ingreient in the so calle ernel tric for the computation of inner-proucts in high imensional spaces using simple functions efine on pairs of training instances. 7. The Kernel Tric We ene with the ual formulation of the SVM problem an notice that the input ata vectors x i are represente by the Gram matrix M. In other wors, only inner-proucts of the input vectors play a role in the ual formulation there is no explicit use of x i or any other function of x i besies inner-proucts. This observation suggests the use of what is nown as the ernel tric to replace the inner-proucts by non-linear functions. The common principle of ernel methos is to construct nonlinear variants of linear algorithms by substituting inner-proucts by nonlinear ernel functions. Uner certain conitions this process can be interprete as mapping of the original measurement vectors so calle input space onto some higher imensional space possibly infinitely high commonly referre to as the feature space. Mathematically, the ernel approach is efine as follows: let x,..., x l be vectors in the input space, say R n, an consier a mapping φx : R n F where F is an inner-prouct space.

3 Lecture 7: Support Vector Machines an Kernel Functions II 7-3 The ernel-tric is to calculate the inner-prouct in F using a ernel function : R n R n R, x i, x φx i φx, while avoiing explicit mappings evaluation of φ. Common choices of ernel selection inclue the th orer polynomial ernels x i, x x i x + θ an the Gaussian RBF ernels x i, x exp x σ i x. If an algorithm can be restate such that the input vectors appear in terms of inner-proucts only, one can substitute the inner-proucts by such a ernel function. The resulting ernel algorithm can be interprete as running the original algorithm on the space F of mappe obects φx. We now that M of the ual form is positive semi-efinite because M can be written is M Q Q where Q [y x,..., y l x l ]. Therefore x Mx Qx 0 for all choices of x which means that the eigenvalues of M are non-negative. If the entries of M are to be replace with y i y x i, x then the conition we must enforce on the function is that it is a positive efinite ernel function. A positive efinite function is efine such that for any set of vectors x,..., x q an for any values of q the matrix K whose entries are K i x i, x is positive semi-efinite. Formally, the conitions for amissible ernels are nown as Mercer s conitions summarize below: Theorem Mercer s Conitions Let x, y be symmetric an continuous. conitions are equivalent: The following. x, y i α iφ i xφ i y φx φy for any uniformly converging series α i > 0.. for all ψ satisfying x ψ xx <, then x, yψxψyxy 0 x y 3. for all {x i } q i an for all q, the matrix K i x i, x is positive semi-efinite. Perhaps the non-obvious conition is No. which allows for the feature map φ to have infinitely many coorinates a vector in Hilbert space. For example, as we shall see below, the ernel exp x σ i x is an inner-prouct of two vectors with infinitely many coorinates. We will consier next a number of popular ernels. 7.. The Homogeneous Polynomial Kernel Let x, y R an efine x, y x y where > 0 is a natural number. Then, the corresponing feature map φx has + O coorinates which tae the value: φx x n n,..., n xn n i 0, P i n i where n,...,n!/n! n! is the multinomial coefficient number of ways to istribute balls into bins where the th bin hol exactly n 0 balls: x x n,..., n n i 0, P i n i x n xn. The imension of the vector space φx where x R can be measure using the following combinatorial problem: how many arrangements of partitions to be place among items? the answer is + + O. For example, gives us : where φx x, x, x x. x y x y + x x y y + x y φx φy,

4 Lecture 7: Support Vector Machines an Kernel Functions II The non-homogeneous Polynomial Kernel The feature map φx contains all monomials whose power is lesser or equal to, i.e., i n i. This can be acheive by increasing the imension to + where n + is use to fill the gap between i n i < an. Therefore the imension of φx where x R woul be +. We have: x y + θ x y x y + θ θ n,..., n + n i 0, P + i n i Therefore, the entries of the vector φx tae the values: φx x n n,..., n xn θn +/ + For example, gives us : x n yn xn yn θ n +/ θ n +/ n i 0, P + i n i x y + θ x y + x x y y + x y + θx y + θx y + θ φx φy, where φx x, x, x x, θx, θx, θ. In this example, φ is a mapping from R to R 6 an hyperplanes φw φx b 0 in R 6 correspon to conics in R : w x + w x + w w x x + θw x + θw x + θ b 0 Assume we woul lie to fin a separating conic Parabola, Hyperbola, Ellipse function rather than a line in R. The iscussion so far suggests we construct the Gram matrix M in the ual form with the polynomial ernel x, y x y + θ for some parameter θ of our choosing. The extra effort we will nee to invest is negligible simply replace every occurrence x i x with x i x + θ The RBF Kernel The function x, y e x y /σ nown as a Raial Basis Function RBF is a ernel function but with an infinite expansion. Without loss of generality let σ, then we have: e x y / e x / e y / e x y x y e x / e y /! 0 x e /! 0 0 P i n i e y / x y! e x /! n,..., n / x n xn e y /! From which we can see that the entries of the feature map φx are: x e φx / x n /! n,..., n xn n,..., n 0,..,, P i n i / y n yn

5 Lecture 7: Support Vector Machines an Kernel Functions II Classifying New Instances By aopting some ernel we are in fact mapping x φx, thus we then procee to solve for φw an b using some QLP solver. The QLP solution of the ual form will yiel the solution for the Lagrange multipliers µ,..., µ m. We saw from eqn. 7. that we can express φw as a function of the mappe examples: φw µ i y i φx i. i Rather than explicitly representing φw a tas which may be prohibitly expensive since in general the imension of the feature space of a polynomial mapping is + we store all the support vectors those input vectors with corresponing µ i > 0 an use them for the evaluation of test examples: fx signφw φx b sign i µ i y i φx i φx b sign i µ i y i x i, x b. We see that the ernel tric enable us to loo for a non-linear separating surface by maing an implicit mapping of the input space onto a higher imensional feature space using the same ual form of the SVM formulation the only change require was in the way the Gram matrix was constructe. The price pai for this convenience is to carry all the support vectors at the time of classification fx. A couple of notes may be worthwhile at this point. The constant b can be recovere from any of the support vectors. Say, x + is a positive support vector but not a margin error, i.e., ɛ 0. Then φw φx + b from which b can be recovere. The secon note is that the number of support vectors is typically aroun 0% of the number of training examples empirically. Thus the computational loa uring evaluation of fx may be relatively high. Approximations have been propose in the literature by looing for a reuce number of support vectors not necessarily aligne with the training set but this is beyon the scope of this course. The ernel tric gaine its popularity with the introuction of the SVM but since then has taen a life of its own an has been applie to principal component analysis PCA, rige regression, canonical correlation analysis CCA, QR factorization an the list goes on. We will meet again with the ernel tric later on.

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin