Lecture 10: A brief introduction to Support Vector Machine Advanced Applied Multivariate Analysis STAT 2221, Fall 2013 Sungkyu Jung Department of Statistics, University of Pittsburgh Xingye Qiao Department of Mathematical Sciences Binghamton University, State University of New York E-mail: sungkyu@pitt.edu http://www.stat.pitt.edu/sungkyu/aama/ 1 / 22
What s the Problem? Training Data: { (y i, x i ), i = 1 n, y i Y, x i S R d}. Goal: a mapping φ : S Y Inputs: x S. Outputs: y Y. Regression Setting Y R. Predict y i as ŷ i = φ(x i ), so that i (y i ŷ i ) 2 is as small as possible. (Binary) Classification Setting Y = { 1, +1}. Predict y i as ŷ i = φ(x i ), so that i 1 {y i ŷ i } is as small as possible. 2 / 22
0 1 Loss or Other Loss Goal: a mapping φ : S Y, which predicts y i as ŷ i = φ(x i ), so that y i ŷ i as few as possible. Introduce a function of x, f (x): let φ(x) = sign(f (x)). f (x) > 0 +1 (positive class; case), f (x) < 0 1 (negative class; control). min f i 1 {y i φ(x i )}, or equivalently min f i 1 {yi f (x i ) 0} 0 1 loss function: 1 {u 0}, where u = y i f (x i ). may not work: not convex!!! What function looks similar to 0 1 loss, but is convex? One answer: Hinge loss, [1 u] +. 3 / 22
loss 0 1 2 3 4 hingle loss 0 1 loss DWD loss (C=1) 2 1 0 1 2 3 u 4 / 22
Linear Discrimination If f (x) = x ω + β is linear, then f (x) = 0 is a hyperplane. Want: a hyperplane in the middle of two classes. 5 / 22
Separable Case x2 4 2 0 2 4 Class +1 Class 1 4 2 0 2 4 x1 6 / 22
Separable Case x2 4 2 0 2 4 Class +1 Class 1 4 2 0 2 4 x1 7 / 22
Separable Case x2 4 2 0 2 4 Class +1 Class 1 4 2 0 2 4 x1 8 / 22
Separable Case x2 4 2 0 2 4 Class +1 Class 1 4 2 0 2 4 x1 9 / 22
Linear Discrimination If f (x) = x ω + β is linear, then f (x) = 0 is a hyperplane. Want: a hyperplane in the middle of two classes. Want: the margin induced by the hyperplane between these two classes to be large. 10 / 22
Illustration Plot of SVM 6 4 Positive class Negative class u=0 u=1 Support Vector 2 0 2 4 6 11 / 22
Formulation The margin will be 2 f (x SV ). We want to maximize it. ω By the definition of support vectors, also need to make sure that: i, y i f (x i ) y SV f (x SV ) = f (x SV ) But, the norm of ω can be arbitrary. Rescale the norm of ω, so that f (x SV ) = 1. So we will try to Equivalently, max ω,β 2 ω, s.t. y i f (x i ) 1. ω 2 min ω,β 2, s.t. y i f (x i ) 1. 12 / 22
ω 2 min ω,β 2, s.t. y i f (x i ) 1. Non-separable Case If i, y i f (x i ) 1 is absolutely impossible, introduce a slack variable ξ i 0 for each data vector x i. y i f (x i ) 1 relaxed to y i f (x i ) 1 ξ i. ξ i represents the violation of correct classification. We want i ξ i to be small. Add it to the objective function, rescaled by a constant C. min ω,β,ξ ( ω 2 + C ) ξ i, 2 i s.t. y i f (x i ) 1 ξ i, ξ i 0. 13 / 22
Regularization Framework Equivalent to, min ω,β ( ω 2 2 + C i [1 y i f (x i )] + ). Or, Or, min ω,β ( 1 n i 1 min ω,β n [1 y i f (x i )] + + λ ω 2 2 [1 y i f (x i )] +, i s.t. ω 2 L. ). 14 / 22
Remarks about SVM Solved by Quadratic Programming (QP) of the duality problem. Final solution is ω = i α i y i x i. For support vectors, α i > 0; otherwise α i = 0 Important observation: determined / influenced by the support vectors only. 15 / 22
Solution How to solve SVM? Take the first formulation for non-separable case as an example: ( ω 2 Primal: min + C ) ξ i, ω,β,ξ 2 i s.t. y i f (x i ) 1 ξ i, ξ i 0. We will derive the duality problem, which is a standard Quadratic Programming (QP) problem. Then use a standard QP package to solve it. 16 / 22
Duality of SVM Lagrangian: L S = ω ω 2 + i {Cξ i α i [y i (x iω + β) + ξ i 1] µ i ξ i }. Here α i, µ i 0 are Lagrange multipliers. KKT conditions say L S ω, L S β and L S ξ i all need to equal to 0. ω i α i y i x i = 0, α i y i = 0, C α i µ i = 0. Moreover, α i [y i (x i ω + β) + ξ i 1] 0 and µ i ξ i 0. i 17 / 22
Duality of SVM KKT conditions combined with α i 0, µ i 0, lead to the dual problem: Dual: max 1 α i α j y i y j x α i 2 ix j + α i i j i s.t. i α i y i = 0, 0 α i C. This is a standard QP problem. Solvable. Very fast. There are n unknown variables instead of d + 1 + n as in the primal problem. 18 / 22
From Dual Solution to Primal Solution The ω can be found by the first KKT condition: ω = i α i y i x i. Also note that α i [y i (x i ω + β) + ξ i 1] 0 and µ i ξ i 0, and C α i µ i = 0. Thus α i > 0 y i (x iω + β) + ξ i 1 = 0, α i < C µ i 0 ξ i = 0. In order to find out β, we find a data vector (x i, y i ) where 0 < α i < C, and calculate β = (x i ω) ξ i/y i + 1/y i. 19 / 22
Nonlinear SVM Why do we learn the dual solution? To implement the SVM To extend SVM for nonlinear classification by the kernel trick In computation of linear SVM, the inner products x i x j = x i x j (or the dot product), for i, j = 1,..., n are sufficient. For nonlinear SVM, each input vector x i is mapped to a higher-dimensional feature space: x i φ(x i ) R Q, and the inner products φ(x i ) φ(x j ) are only needed in computation. The kernel trick replaces φ(x i ) φ(x j ) with a kernel k(x i, x j ). Then the mapping φ is implicitly defined by k. 20 / 22
The linear kernel: Examples of popular kernels k(x, y) = x y. The leads to the original, linear SVM. The polynomial kernel: k(x, y) = (c + x y) d. We can write down the expansion explicitly, so the mapping φ can be explicitly expressed. The Gaussian (radial basis function) kernel: k(x, y) = exp ( 1σ ) 2 x y 2. The feature space where φ(x) lies is infinite dimensional. (The mapping φ is only implicitly defined) 21 / 22
Software library(kernlab) # train the linear SVM svp <- ksvm(xtrain,ytrain,kernel= vanilladot ) # Nonlinear SVM with Gaussian kernel svp <- ksvm(xtrain,ytrain,kernel= rbfdot ) # Nonlinear SVM with polynomial kernel svp <- ksvm(xtrain,ytrain,kernel= polydot ) # Predict labels on test ypred = predict(svp,xtest) table(ytest,ypred) 22 / 22