Support Vector Machine Kernel: Kernel is defined as a function returning the inner product between the images of the two arguments k(x 1, x 2 ) = ϕ(x 1 ), ϕ(x 2 ) k(x 1, x 2 ) = k(x 2, x 1 ) modularity- it is possible to construct new kernels from old ones, if k 1, k 2 are kernels then k 1 + k 2 is a kernel ck 1 is a kernel for c > 0 ak 1 + bk 2 is a kernel for a, b > 0 1
Examples: k(x, y) = x, y n k(x, y) = e x y 2 /(2σ) Let x = (x 1, x 2 ) T, y = (y 1, y 2 ) T. x, y 2 = (x 1 y 1 + x 2 y 2 ) 2 = x 2 1 y2 1 + x2 2 y2 2 + 2x 1y 1 x 2 y 2 = (x 2 1, x2 2, 2x 1 x 2 ), (y 2 1, y2 2, 2y 1 y 2 ) = ϕ(x), ϕ(y) (1) Given a set of vectors x 1,..., x N a kernel matrix is defined as K = k(x 1, x 1 ) k(x 1, x 2 ) k(x 1, x N ) k(x 2, x 1 ) k(x 2, x 2 ) k(x 2, x N ). k(x N, x 1 ) k(x n, x 2 ) k(x N, x N ) 2
Properties: the kernel matrix is positive semi-definite (i.e. for any vector a it satisfies a T Ka 0) any symmetric (i.e. a ij = a ji ) positive definite matrix can be regarded as a kernel matrix (an inner product matrix in some space) every (semi) positive definite, symmetric function K is a kernel, i.e. there exists a mapping ϕ such that k(x 1, x 2 ) = ϕ(x 1 ), ϕ(x 2 ) The kernel can be expanded in the series k(x 1, x 2 ) = i=1 λ i ϕ i (x 1 )ϕ i (x 2 ), λ i > 0 i where λ i are nonnegative eigenvalues. 3
Cover s Theorem: Given a set of training data that is not linearly separable, one can with high probability transform it into a training set that is linearly separable by projecting it into a higher-dimensional space via some non-linear transformation. 8 x 2 6 4 2 0 φ 3 8 6 4 2 2 4 6 8 10 5 0 5 10 x 1 10 0 0 φ 2 10 10 5 φ 1 0 5 10 The data is not linearly separable in the original space [x 1, x 2 ]. Select ϕ 1 = x 1, ϕ 2 = x 2, ϕ 3 = x 2 1 + x2 2. The data becomes linearly separable in the new, higher dimensional feature space. 4
Maximum Margin Classifiers: Consider the twoclass classification problem using the linear models of the form y(x) = w T ϕ(x) + b (2) where ϕ(x) denotes a fixed feature-space transformation. The training data set comprises N input vectors x 1,...,x N, with corresponding target values t 1,..., t N where t n { 1, 1}, and the new data points x is according to the sign of y(x). Nonoverlapping class distribution: We start with this case. The support vector machine approaches the problem by maximizing the margin, which is defined to be the smallest distance between the decision boundary and any of the samples. 5
w 0 w w g(x) w x g>0 g<0 g=0 Since a data point s distance to the hyperplane is given as y(x)/ w, in the separable case, we can set y(x) = 1 for the nearest point, so the margin is 1/ w. In the case that the training data set is linear separable in the feature space, for the linear classifier y(x) = w T ϕ(x) + b (3) 6
we have t n y(x n ) > 0, for all data points. We can set t n y(x n ) = 1 for the point that is closest to the surface. All data points should satisfy the constraints t n ( w T ϕ(x) + b ) 1, n = 1,..., N. (4) We have to solve the optimization problem arg min w,b 1 2 w 2 (5) subject to (4), which is an example of quadratic programming (QP) problem. Introducing Lagrange multipliers a n 0, for each of the constraints in (4), giving the Lagrangian function L(w, b, a) = 1 2 w 2 where a = [a 1,..., a N ] T. a n {t n ( w T ϕ(x)+b ) 1} (6) 7
Setting the derivatives of L(w, b, a) with respect to w, b equal to zero, we obtain the following two conditions w = 0 = Substitute these into (3) we have y(x) = a n t n ϕ(x n ) (7) a n t n (8) a n t n k(x, x n ) + b (9) where k(x, x m ) is the kernel function. Eliminating w, b from L(w, b, a) using these conditions then gives the dual representation of the maximum margin problem of L(a) = a n 1 2 m=1 a n a m t n t m k(x n, x m ) 8 (10)
subject to a n 0 (11) 0 = a n t n (12) which is also a quadratic programming problem. Applying the KKT conditions, a n 0 t n y(x n ) 1 0 a n {t n y(x n ) 1} = 0 Thus for every data point either a n = 0 or t n y(x n ) = 1. Any data point with a n = 0 will not appear in (9). The remaining data points are called support vectors, satisfying t n y(x n ) = 1. They correspond to points that lie on the maximum margin hyperplane in feature space, i.e 9
y(x) = n S a n t n k(x, x n ) + b (13) where S denotes the set in indices of the support vectors. Having solved the QP problem and found a, we make use of t 2 n = 1 to solve b as b = 1 ( t n ) a m t m k(x n, x m ) (14) N S m S n S where N S is the total number of support vectors. Example: XOR problem. Input vector x i Desired response t i (-1,-1) -1 (-1, 1) 1 (-1, 1) 1 (1, 1) -1 10
Let k(x, x i ) = (1 + x T x i ) 2, with x = [x 1, x 2 ] T and x i = [x i,1, x i,2 ] T. k(x, x i ) = 1 + x 2 1 x2 i,1 + 2x 1x i,1 x 2 x i,2 + x 2 2 x2 i,2 + 2x 1x i,1 + 2x 2 x i,2 = ϕ(x) T ϕ(x i ) (15) with ϕ(x) = [1, x 2 1, 2x 1 x 2, x 2 2, 2x 1, 2x 2 ] T and ϕ(x i ) = [1, x 2 i,1, 2x i,1 x i,2, x 2 i,2, 2x i,1, 2x i,2 ] T, i. The kernel matrix is K = 9 1 1 1 1 9 1 1 1 1 9 1 1 1 1 9 The objective function in dual form is L(a) = a n 1 2 m=1 = a 1 + a 2 + a 3 + a 4 1 2 a n a m t n t m k(x n, x m ) ( 9a 2 1 2a 1a 2 2a 1 a 3 + 2a 1 a 4 + 9a 2 2 + 2a 2a 3 2a 2 a 4 + 9a 2 3 2a 3 a 4 + 9a 2 4 ) (16) 11
Setting the derivatives of L(a) as zeros, we have 9a 1 a 2 a 3 + a 4 = 1 a 1 + 9a 2 + a 3 a 4 = 1 a 1 + a 2 + 9a 3 a 4 = 1 a 1 a 2 a 3 + 9a 4 = 1 we have a 1 = a 2 = a 3 = a 4 = 1/8. The result indicates all four data samples are support vectors. The weight vector w = 4 a n t n ϕ(x n ) = [0, 0, 1/ 2, 0, 0, 0] T (17) 12
The bias term b is b = 1 ( t n a m t m k(x, x m ) N S m S = 1 4 = 1 4 n S 4 4 ( t n 1 4 ) t m k(x n, x m ) 8 m=1 (t n 18 ) ( 12 + 12 + 12 12) ) = 0 (18) So the decision hyperplane is given by the polynomial x 1 x 2 = 0. 1 0.5 0 0.5 1 1 0.5 0 0.5 1 1 0 1 13