Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Size: px

Start display at page:

Download "Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University"

Bathsheba Gallagher
5 years ago
Views:

1 Chapter 9. Support Vector Machine Yongdai Kim Seoul National University

2 1. Introduction Support Vector Machine (SVM) is a classification method developed by Vapnik (1996). It is thought that SVM improved the neural network (NN) significantly. In particular, SVM is thought to have firm theoretical bases while the NN does not. There are, at least, two ways to explain the SVM. Starting from the Optimal separating hyperplane by Vapnik (1996) Empirical risk minimization and Reproducing kernel Hilbert space (RKHS) by Wahba (2002) Seoul National University. 1

3 In this chapter, I mostly explain the Vapnik s idea and briefly mention the relation to the ERM principle. RKHS will be dealt in the next chapter since it relates to not only SVM but also many methods such as smoothing splines. Seoul National University. 2

4 2. Optimal Separating Hyperplane, OSH Assume that there is a linear decision boundary (called a separating hyperplane ) given as w x + b = 0 which separate out the data completely. Seoul National University. 3

5 There are infinitely separating hyperplanes once the data are linearly separable. The optimal separating hyperplane is the separating hyperplane which maximizes the margin defined by margin(w, b) = min{ the distence of x i i from the boundary w x + b = 0.} Seoul National University. 4

6 x2 x1 margin margin +1-1 Wx+b=0 Seoul National University. 5

7 Four questions Is the optimal separating hyperplane good? How do we find the optimal separating hyperplane? How do we work with linearly non-separable data? How do we make nonlinear decision boundary? Seoul National University. 6

8 3. Theoretical motivation Consider the set of classifiers {g(x w, b) = sign(w x + b)}. Let R n (w, b) and R(w, b) be the training error and test error of the classifier g(x w, b) where and n R n (w, b) = I(y i g(x i w, b))/n i=1 R(w, b) = E(I(Y g(x w, b))). Seoul National University. 7

9 Then, we have (Vapnik 1998) R(w, 0) E ( Dn+1 ρ n+1 ) /(n + 1) where D n+1 = max i=1,...,n+1 x i and ρ n+1 is the margin of the decision boundary w x = 0 for (n + 1) examples randomly drawn from the population. The above upper bound of the test error implies that by maximizing the margin, we can decrease the upper bound of the test error. Seoul National University. 8

10 4. Computation of OSH It can be shown that to find OSH, we need to solve the following constraint optimization problem: Minimize τ(w) = w 2 /2 subject to y i (wx i + b) 1 for i = 1,..., n. The above optimization problem is a QP (Quadratic Programing) problem, and its dual problem is Maximize W (α) = n i=1 α i n i,j α iα j y i y j x i x j/2 subject to α i 0. The final decision boundary is sign( n i=1 α iy i x i x + b). Seoul National University. 9

11 Remarks. The optimal separating hyperplane depends on x 1,..., x n throughout the inner products between x and x i s. This is an important feature which is utilized fully for constructing a nonlinear decision boundary. In the dual problem, the number of variables (the dimension of α) is always the same as the sample size no matter how large the dimension of x. This is a practically important feature in case when the dimension of x is large compare to the sample size which is frequently observed in Microarray data. Seoul National University. 10

12 By the Karush-Kuhn-Tucker conditions, we have α i (y i (w x i + b) 1) = 0, which implies that only when the distance of x i from the OSH is minimal (called support vector ), the corresponding α i > 0 and α i = 0 otherwise. That is, the OSH only depends on the locations of the support vectors. This property makes the OSH very robust to outliers. Seoul National University. 11

13 5. For linearly non-separable data: Soft margin Idea: Using slack variables The OSH with slack variables can be found by solving the following optimization problem: Minimize τ(w, ξ) = w 2 /2 + C n i=1 ξ i subject to y i (w x + b) 1 ξ i for i = 1,..., n ξ i 0 for i = 1,..., n. Seoul National University. 12

14 Soft Margin Seoul National University. 13

15 Remark The constant C controls how much we allow soft margins. If C =, the final OSH is the same as that without slack variables (provided the data is linear separable), and if C = 0, w = 0 is the OSH. In other words, C controls the complexity of the final OSH. The dual optimization problem is Maximize W (α) = n i=1 α i n i,j α iα j y i y j x i x j/2 subject to 0 α i C and n i=1 α iy i = 0. Seoul National University. 14

16 Feature Space 6. Toward non-linear decision boundary The main idea is to transform x by a nonlinear map ϕ : R p F and construct the linear decision boundary based on ϕ(x). Here, F is called the feature space. Seoul National University. 15

17 To consturct the OSH based on a given feature map ϕ, we need to solve Maximize W (α) = n i=1 α i n i,j α iα j y i y j k(x i, x j )/2 subject to α i 0 where k(x i, x j ) = ϕ(x i ) ϕ(x j ). Hence, to get the OSH with feature ϕ(x), only thing we need is the inner product function (called kernel ) k(x i, x j ) = ϕ(x i ) ϕ(x j ). In practice, we choose the kernel k instead of the feature map ϕ. Note that the final decision boundary does not depend on the dimension of the feature space, which can be infinite. This is an important advantage when the dimension of the feature space is large, which is frequently observed, in particular, in microarray data. Seoul National University. 16

18 Choice of Kernel A necessary and sufficient condition of a given kernel k being an inner product of a certain feature map, the matrix [k(x i, x j )] i,j is positive definite. Kernel functions satisfying this condition is called Mercer s kernel. Examples of Mercer kernels are Polynomial: k(x 1, x 2 ) = (1 + x 1x 2 ) d Radial Basis: k(x 1, x 2 ) = exp( x 1 x 2 2 /c) Neural network: k(x 1, x 2 ) = tanh(κ 1 x 1x 2 + κ 2 ) Seoul National University. 17

19 7. Support Vector Machine, SVM SVM = Feature space (kernel) + soft margin For using SVM, we need to choose a Kernel k(x 1, x 2 ) and the regularization parameter C for soft margin. for a given data set. There is no systematic way of doing so. This problem is a model selection problem from the statistical point of view. Many researches are on going for this matter. Seoul National University. 18

20 Pros and cons of SVM Cons Theoretical well established (compared to other methods) The dimension of input is not a matter as long as the kernel is well selected. The final model is robust to outliers. Pros Selection of kernel is not easy. Computation is very hard, in particular when the sample size is large. Interpretation is almost impossible. Seoul National University. 19

21 8. Understanding SVM through the ERM principle Recall that SVM solves the following optimization problem: Minimize τ(w, ξ) = w 2 /2 + C n i=1 ξ i subject to y i (w x + b) 1 ξ i for i = 1,..., n ξ i 0 for i = 1,..., n. This is equivalent to solve the following problem: Minimize w 2 /2 + C n [1 y i (w x + b)] + (1) i=1 with respect to w where [z] + = zi(z > 0). Seoul National University. 20

22 If we let l(y, f) = [1 yf] + (called the hinge loss ), then (1) becomes n l(y i, w x + b) + λ w 2 /2. for some λ > 0. i=1 Hence, the SVM estimator is a shrinkage estimator with the hinge loss and ridge penalty. Seoul National University. 21

23 9. References Vapnik, V. (1996). The nature of statistical learning theory, Springer-Verlag, New-York. Wahba, G. (2002). Soft and Hard Classification by Reproducing Kernel Hilbert Space Methods. Proceedings of the National Academy of Sciences, 99, Seoul National University. 22

Support Vector Machines

Support Vector Machines Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)