Soft-Margin Support Vector Machine Chih-Hao Chang Institute of Statistics, National University of Kaohsiung @ISS Academia Sinica Aug., 8 Chang (8) Soft-margin SVM Aug., 8 / 35
Review for Hard-Margin SVM Hard-Margin Versus Soft-Margin 3 Soft-Margin SVM Discussion Chang (8) Soft-margin SVM Aug., 8 / 35
Data set: {y n,x n } N, y n {+, } and x n R d. Example for d = and N = : x Chang (8) Soft-margin x SVM Aug., 8 3 / 35
Original problem of a linear classifier, h(x) = w x+b = : max margin(h), subject to h classifies every (x n,y n ) correctly, Parametrization of the problem: max margin(h), margin(h) = min,...,n distance(x n,h). subject to y n (w x n +b) > ; n =,...,N, margin(h) = min,...,n w y n(w x n +b). Chang (8) Soft-margin SVM Aug., 8 / 35
The modified problem of the classifier, h(x) = w x+b = : max margin(h), subject to y n (w x n +b) > ; n =,...,N, margin(h) = min,...,n w y n(w x n +b). Special scaling the minimal distance at : max b,w subject to w, min y n(w x n +b) =.,...,N Chang (8) Soft-margin SVM Aug., 8 5 / 35
Why can we scale the minimal distance at an arbitrary value? Suppose min,...,n y n (w x n +b) =. The margin is then / w. 3 To maximize / w is the same to maximize / w. The optimization problem is max b,w subject to w, and compare the the previous one: max b,w subject to min,...,n y n( w x n + b w, Obviously, w = w / w = / w. ) =. min y n(w x n +b) =.,...,N Chang (8) Soft-margin SVM Aug., 8 / 35
The scaled problem of the classifier, h(x) = w x+b = : max b,w subject to w, min y n(w x n +b) =.,...,N Release of constraints: min b,w w w, subject to y n (w x n +b) ; n =,...,N. Chang (8) Soft-margin SVM Aug., 8 7 / 35
Why can we release the constraints? If the optimal (b,w) / {(b,w) : min,...,n y n (w x n +b) = }, then say min,...,n y n (w x n +b) =.. As we previously mentioned, there is another equivalent optimal pair: (b, w) = (b/., w/.). 3 The resulting margin is w =. w > w, which fails the optimality of (b, w), a contradiction. Hence the optimal (b,w) {(b,w) : min,...,n y n (w x n +b) = } Chang (8) Soft-margin SVM Aug., 8 8 / 35
The released problem: min b,w w w, subject to y n (w x n +b) ; n =,...,N. Solution from the process of quadratic programming (QP): min u Qu+p u, subject to a m u c m ; m =,...,M. u Chang (8) Soft-margin SVM Aug., 8 9 / 35
The linear SVM: min b,w w w subject to y n (w x n +b) ; n =,...,N, Solution from the method of Lagrange multipliers: L(b,w,α) = w w+ α n ( y n (w x n +b)), where ( )) SVM = min max L(b,w,α) b,w α n Chang (8) Soft-margin SVM Aug., 8 / 35
Why ( SVM = min b,w max α n ( w w+ Can y n (w x n +b) < be happened? If yes, y n (w x n +b) >, then α n =. 3 This cannot be a solution of SVM. If y n (w x n +b) <, α n =. )) α n ( y n (w x n +b))? 3 Hence α n > can only be happened on y n (w x n +b) = (complementary slackness). Chang (8) Soft-margin SVM Aug., 8 / 35
For L(b,w,α) = w w+ α n ( y n (w x n +b)), the Lagrange dual problem (weak duality) is: ( )) SVM = min max L(b,w,α) max b,w α n The equality holds for strong duality: Convex; Existence of he solution; 3 Linear constraints. α n ( ) min L(b,w,α). b,w Chang (8) Soft-margin SVM Aug., 8 / 35
Simplify the dual problem: under the strong duality, since ( SVM = max min α n b,w w w + ( = max α min n, y nα n= w w w + ) α n ( y n (w x n +b)) ) α n ( y n (w x n )), L(b,w,α) b = α n y n =. Chang (8) Soft-margin SVM Aug., 8 3 / 35
Further simplify the dual problem: under the strong duality and by L(b,w,α) w i and hence SVM = = w i α n y n x n,i = w = ( max α min n, y nα n= w w w+ = max α n, y nα n=,w= α ny nx n ( w w+ α n y n x n, ) α n ( y n (x n w)) ) α n w w = max ( N α n, y α n y n x n + α n ). nα n=,w= α ny nx n Chang (8) Soft-margin SVM Aug., 8 / 35
Karush-Kuhn-Tucker (KKT) conditions to solve b and w from optimal α: Primal feasible: Dual feasible: y n (w x n +b) ; α n ; 3 Dual-inner optimal: yn α n =, and w = α n y n x n ; Primal-inner optimal (complementary slackness): α n ( y n (w x n +b)) =. Chang (8) Soft-margin SVM Aug., 8 5 / 35
The dual problem: max ( N ) α n, y α n y n x n + α n nα n=,w= α ny nx n The equivalent standard dual SVM problem: min α subject to m= α n α m y n y m x nx m y n α n = ; α n ; for n =,...,N, α n, which has N variables and N + constraints, and is a convex quadratic programming problem. Chang (8) Soft-margin SVM Aug., 8 / 35
By KKT conditions, we have w = α n y n x n and for complementary slackness condition: α n ( y n (w x n +b)) = ; n =,...,N, b = y s w x s, with α s >, Or let I = {s =,...,N : α s > }, b = I y s w x s, s I where I is the set of support vectors: y s (w x n +b) = ; s I. Chang (8) Soft-margin SVM Aug., 8 7 / 35
Primal linear SVM (suitable for small d): min b,w d+ variables; N constraints; w w subject to y n (w x n +b) ; n =,...,N. Dual SVM (suitable for small N): min α α Qα α; Q = {q n,m }; q n,m = y n y m x n x m, subject to y α = ; N variables; N + constraints. α n ; for n =,...,N, Chang (8) Soft-margin SVM Aug., 8 8 / 35
Review on linear and Kernel SVM How about non-linear boundary?.5..5 x x..5..5 x x Chang (8) Soft-margin SVM Aug., 8 9 / 35
Let Φ : R d R d and z n = Φ(x n ); n =,...,N. Kernel function: K Φ (x n,x m ) = Φ(x n ) Φ(x m ). 3 The dual problem on the transform function Φ: min α α Q Φ α α;, subject to y α = ; α n ; for n =,...,N, where Q Φ = {q n,m }; q n,m = y n y m K Φ (x n,x m ). Chang (8) Soft-margin SVM Aug., 8 / 35
Polynomial kernel function: K q (x,x ) = (ζ +γx x ) q ; γ >, ζ, which is commonly applied for polynomial SVM. Gaussian kernel function: K(x,x ) = exp( γ x x ), γ >, which is a kernel of infinite dimensional transform and for d = (i.e. x = x), ( Φ(x) = exp( x ),! x, )! x,..., Chang (8) Soft-margin SVM Aug., 8 / 35
The classifier of kernel svm, gsvm(x) = sign(w Φ(x)+b): ( N ) zs b = y s w z s = y s α n y n z n w Φ(x) = = y s α n y n K(x n,x s ), N α n y n z n z = α n y n K(x n,x). We don t need to know w and the hyperplane classifier can be a mystery. Chang (8) Soft-margin SVM Aug., 8 / 35
Hard-Margin SVM..5 x x x..5. x x x 5 8 x x 5 x 5 5 Chang (8) Soft-margin SVM Aug., 8 3 / 35
Soft-Margin SVM..5 x x x..5. x x x..5 x x x..5. Chang (8) Soft-margin SVM Aug., 8 / 35
Soft-Margin SVM If always insisting on separable, SVM tends overfitting to noise. Hard-margin SVM: min b,w,ξ w w, subject to y n (w x n +b). 3 Tolerance on the noise by allowing the data points being incorrectly classified. Record the margin violation by ξ n ; n =,...,N. 5 Soft-margin SVM: min b,w,ξ w w+c ξ n subject to y n (w x n +b) ξ n, ξ n ; n =,...,N, Chang (8) Soft-margin SVM Aug., 8 5 / 35
Soft-Margin Dual SVM Quadratic programming with d + + N variables and N constraints: min b,w,ξ w w+c ξ n subject to y n (w x n +b) ξ n, ξ n ; n =,...,N, ( ) The dual problem: SVM = min max L(b,w,ξ,α,β), b,w,ξ α n,β n L(b,w,ξ,α,β) = w w +C + ξ n + β n ( ξ n ) α n ( ξ n y n (w z n +b)). Chang (8) Soft-margin SVM Aug., 8 / 35
Soft-Margin Dual SVM Simplify the soft-margin dual problem: under the strong duality, SVM = ( max min α n,β n b,w,ξ w w+c + ξ n + ) α n ( ξ n y n (w z n +b)) ( = max min α n C,β n=c α n b,w w w+ β n ( ξ n ) ) α n ( y n (w z n +b)), since L ξ n = C α n β n = β n = C α n. Chang (8) Soft-margin SVM Aug., 8 7 / 35
Soft-Margin Dual SVM Further simplify the soft-margin dual problem: under the strong duality and by we have SVM = L b = N L w i = w = α n y n =, α n y n z n, { max α n C,β n=c α n, y nα n=,w= α ny nz n N α n y n z n + α n }. Chang (8) Soft-margin SVM Aug., 8 8 / 35
Soft-Margin Dual SVM The soft-margin dual SVM: SVM = { max α n C,β n=c α n, y nα n=,w= α ny nz n N α n y n z n + α n }. QP with N variables and N + constraints: min α n α m y n y m z α nz m subject to m= α n, y n α n =, α n C; n =,...,N, implicity w = α n y n z n, β n = C α n ; n =,...,N, Chang (8) Soft-margin SVM Aug., 8 9 / 35
Soft-Margin Kernel SVM QP with N variables and N + constraints: min α subject to m= α n α m y n y m K Φ (x n,x m ) α n, y n α n =, α n C; n =,...,N, implicity w = α n y n z n, β n = C α n ; n =,...,N, The hypothesis hyperplane of soft-margin kernel svm: ( N ) gsvm(x) = sign α n y n K Φ (x n,x)+b, where b can be solved by the complementary slackness. Chang (8) Soft-margin SVM Aug., 8 3 / 35
Soft-Margin Kernel SVM The hypothesis hyperplane: ( N ) gsvm(x) = sign α n y n K Φ (x n,x)+b, The complementary slackness conditions: α n ( ξ n y n (w z n +b)) =, (C α n )ξ n =. For support vectors (with α s > ): b = y s y s ξ s w z s. For free support vectors (with < α s < C): ξ s = b = y s w z s. Chang (8) Soft-margin SVM Aug., 8 3 / 35
Physical Meanings of α n Complementary slackness: α n ( ξ n y n (w z n +b)) =, (C α n )ξ n =. Non-SV (α n = ): ξ n = ; SV ( < α n < C): ξ n = ; 3 Bounded SV (α n = C): ξ n = y n (w z n +b). x α n can be used for data analysis. Chang (8) Soft-margin SVM Aug., 8 3 / 35 x
Selection of Penalty C 5 x x x 5 x x x....5.... x x. x...5....... Chang (8) Soft-margin SVM Aug., 8 33 / 35
R-code of the toy example library(kernlab) set.seed(5) # data generation x = c(rnorm(n,3,),rnorm(n,-3,),rnorm(n,,)) x = c(rnorm(n,3,),rnorm(n,-3,),rnorm(n,,)) y = factor(c(rep(t,n),rep(f,n),rep(c(t,f),n/))) data = data.frame(y =y,x=x,x=x) # fit the soft-margin Gaussian kernel SVM model.ksvm = ksvm(y ~ x + x, data = data, kernel="rbfdot", kpar=list(sigma=),c=) plot(model.ksvm, data=data) Chang (8) Soft-margin SVM Aug., 8 3 / 35
References Special thanks to Prof. Hsuan-Tien Lin Handout slides and youtube vedios: https://www.csie.ntu.edu.tw/~htlin/mooc/ Chang (8) Soft-margin SVM Aug., 8 35 / 35