SVMs: nonlinearity through kernels

Size: px

Start display at page:

Download "SVMs: nonlinearity through kernels"

Charity Butler
5 years ago
Views:

1 Non-separable data e-8. Support Vector Machines 8.. The Optimal Hyperplane Consider the following two datasets: SVMs: nonlinearity through kernels ER Chapter 3.4, e-8 (a) Few noisy data. (b) Nonlinearly separable. Transform your features! 8.. The Optimal Hyperplane Consider the following two datasets: x x zn = Φ(xn ). minimize: b,w x to: subject x (b) Nonlinearly separable. PT A t w w "! yn w t zn + b (8.9) (n =,z =, xn ), z = x objects in where w is now in Rd instead of Rd (recall that we use tilde for The optimization problem in (8.9) is a QP-problem with d + Z space). Linear8.6: with outliers data (reproduction Nonlinear Figure Non-separable of Figure 3.). data is not linearly separable? Figure 8.6 (reproduced from Chapter 3) illustrates the two types of non-separability. In Figure 8.6(a), two noisy data points render the data non-separable. In Figure 8.6(b), the target function is inherently nonlinear. For the learning problem in Figure 8.6(a), we prefer the linear separator, and need to tolerate the few noisy data points. In Chapter 3, we modified the PLA into the pocket algorithm to handle this situation. Similarly, for SVMs, we will modify the hard-margin SVM to the soft-margin SVM in Section 8.4. Unlike the hard margin SVM, the soft-margin SVM allows data points to violate the cushion, or even be misclassified. To address the other situation in Figure 8.6(b), we introduced the nonlinear transform in Chapter 3. There is nothing to stop us from using the nonlinear After transforming the data, we solve the hard-margin SVM problem in the Z space, which is just (8.4) written with zn instead of xn : ER (a) Few noisy data. z x= x z = Non-separable data e-8. Support Vector Machines data is not linearly separable? Figure 8.6 (reproduced from Chapter 3) illustrates the two types of non-separability. In Figure 8.6(a), two noisy data points render the data non-separable. In Figure 8.6(b), the target function is inherently nonlinear. For the learning problem in Figure 8.6(a), we prefer the linear separator, and need to tolerate the few noisy data points. In Chapter 3, we modified the PLA into the pocket algorithm to handle this situation. Similarly, for SVMs, Mechanics of SVM the tofeature Transform I 8.4. we will modify the hard-margin the soft-margin SVM in Section Mechanics of the thesoft-margin FeatureSVM Transform I Unlike the hard margin SVM, allows data points to violate the the data cushion, even bein misclassified. Transform to aorz-space which the data is separable. To address the other situation in Figure 8.6(b), we introduced the nonlinear Transform the data to a Z-space in which the data is separable. 0 Map your data into usingtoa stop nonlinear transform in Chapter 3. Z-space There is nothing us fromfunction using the nonlinear 0 transform with the optimal hyperplane, which we will do here. To render the data separable, we would typically transform into a higher dimension. Consider a transform Φ : Rd Rd. The transformed data are e-c H A PT Both are not 8.6: linearly separable. But there isofafigure difference! Figure Non-separable data (reproduction 3.). x = x 3 = xx cxam Magdon-Ismail, Lin: Jan-05 L Abu-Mostafa, x z = Φ(x) = x = Φ(x) Φ(x) (x) z = Φ(x) = e-chap:8 9 xx = Φ x Φ(x) 4 c AM L Creator: Malik Magdon-Ismail Nonlinear Transforms: 6 /7 Feature transform II: classify in Z-space c AM L Creator: Malik Magdon-Ismail Nonlinear Transforms: 6 /7 Feature transform II: classify in Z-space

2 Mechanics of the Feature Transform III Classification in Z-space To classify a new x, firsttransformx to Φ(x) Z-space and classify there with g. Classification in Z-space In Z-space the data can be linearly separated: g(z) =sign( w t z) a g(x) = g(φ(x)) =sign( w t Φ(x)) g(z) =sign( w t z) c AM L Creator: Malik Magdon-Ismail Nonlinear Transforms: 8/7 Summary of nonlinear transform Must Choose Φ BEFORE Your Look at the Data 5 6 c AM L Creator: Malik Magdon-Ismail Nonlinear Transforms: 7/7 Feature transform III: bring back to X -space After constructing features carefully, before seeing the data... Classification in Z-space...if you think linear is not enough, try the nd order polynomial transform. x x = x Φ(x) = Φ (x) Φ (x) = Φ 3 (x) Φ 4 (x) Φ 5 (x) What can we say about the dimensionality of the Z-space as a function of the dimensionality of the data? x x x x x x c AM L Creator: Malik Magdon-Ismail Nonlinear Transforms: /7 The polynomial transform The General Polynomial Transform Φ k The polynomial Z-space We can We get can even choose fancier: higher degree-k order polynomial polynomials: transform: Φ (x) =(, x,x ), Φ (x) =(, x,x, x,x x,x ), Φ 3 (x) =(, x,x, x,x x,x, x 3,x x,x x,x 3 ), Φ 4 (x) =(, x,x, x,x x,x, x 3,x x,x x,x 3, x 4,x 3 x,x x,x x 3,x 4 ),. What are the potential effects of increasing the order of the Dimensionalityofthefeaturespaceincreasesrapidly(d polynomial? vc )! Similar transforms for d-dimensional original space. Approximation-generalizationtradeoff Higher degree gives lower (even zero) E in but worse generalization. 7 8 c AM L Creator: Malik Magdon-Ismail Nonlinear Transforms: 3/7 Be carefull with nonlinear transforms

The polynomial Z-space A few potential issues Linear model fourth order polynomial Transforming the data seems like a good idea: u But: u u Better chance of obtaining linear separability At the

the complexity of the model: danger of overfitting High order polynomial transform leads to nonsense.

3 The polynomial Z-space A few potential issues Linear model fourth order polynomial Transforming the data seems like a good idea: u But: u u Better chance of obtaining linear separability At the expense of potential overfitting Computationally expensive (memory and time) Kernels: avoid the computational expense by an implicit mapping Feature-space dimensionality increases rapidly and with it the complexity of the model: danger of overfitting High order polynomial transform leads to nonsense. c AM L Creator: Malik Magdon-Ismail Nonlinear Transforms: 5/7 Digits data 9 0 Achieving non-linear discriminant functions How to avoid the overhead of explicit mapping Consider two dimensional data and the mapping (x) =(x, p x x,x ) Let s plug that into the discriminant function: w (x) =w x + p w x x + w 3 x The resulting decision boundary is a conic section. Suppose the weight vector can be expressed as: w = i x i i= The discriminant function is then: w f(x) = X x + b i x i x + b i And using our nonlinear mapping: w (x)+b f(x) = X i i (x i ) (x)+b Turns out we can often compute the dot product without explicitly mapping the data into a high dimensional feature space! parabola circle/ellipse hyperbola 3

4 Let s go back to the example and compute the dot product Example (x) =(x, p x x,x ) (x) (z) =(x, p x x,x ) (z, p z z,z) = x z +x x z z + x z =(x z) Do we need to perform the mapping explicitly? Let s go back to the example and compute the dot product Example (x) =(x, p x x,x ) (x) (z) =(x, p x x,x ) (z, p z z,z) = x z +x x z z + x z =(x z) Do we need to perform the mapping explicitly? NO! Squaring the dot product in the original space has the same effect as computing the dot product in feature space. 3 4 Kernels Definition: A function k(x, z) that can be expressed as a dot product in some feature space is called a kernel. In other words, k(x, z) is a kernel if there exists such that Why is this interesting? k(x, z) = (x) (z) : X 7! F If the algorithm can be expressed in terms of dot products, we can work in the feature space without performing the mapping explicitly! The dual SVM problem The dual SVM formulation depends on the data through dot products, and so can be expressed using kernels. Replace with: maximize i= i subject to: 0 apple i apple C, maximize i= i subject to: 0 apple i apple C, i= j= i j y i y j x i x j i y i =0 i= i= j= i j y i y j k(x i, x j ) i y i =0 i= 5 6 4

Standard kernel functions Standard kernel functions The linear kernel Homogeneous polynomial kernel Polynomial kernel Gaussian kernel k(x, z) =x z k(x, z) =(x z) d k(x, z) =(x z + ) d k(x, z) =exp( x

5 Standard kernel functions Standard kernel functions The linear kernel Homogeneous polynomial kernel Polynomial kernel Gaussian kernel k(x, z) =x z k(x, z) =(x z) d k(x, z) =(x z + ) d k(x, z) =exp( x z ) The linear kernel Homogeneous polynomial kernel Polynomial kernel Gaussian kernel k(x, z) =x z k(x, z) =(x z) d k(x, z) =(x z + ) d k(x, z) =exp( x z ) Feature space: Original features All monomials of degree d All monomials of degree less or equal to d Infinite dimensional 7 8 Demo Demo Using polynomial kernel: Using the Gaussian kernel: k(x, z) =exp( x z ) k(x, z) =(x z + ) d 9 0 5

Kernelizing the perceptron algorithm Kernelizing the perceptron algorithm Recall the primal version of the perceptron algorithm: Input: labeled data D in homogeneous coordinates Output: weight vector

coordinates Output: weight vector α α = 0 converged = false while not converged : w = converged = true for i in,, D : if x i is misclassified update α and set converged=false i x i i= What do you

6 Kernelizing the perceptron algorithm Kernelizing the perceptron algorithm Recall the primal version of the perceptron algorithm: Input: labeled data D in homogeneous coordinates Output: weight vector w w = 0 converged = false while not converged : converged = true for i in,, D : if x i is misclassified update w and set converged=false w 0 = w + y i x i Input: labeled data D in homogeneous coordinates Output: weight vector α α = 0 converged = false while not converged : w = converged = true for i in,, D : if x i is misclassified update α and set converged=false i x i i= What do you need to change to express it in the dual? (I.e. express the algorithm in terms of the alpha coefficients) Kernelizing the perceptron algorithm Linear regression revisited Input: labeled data D in homogeneous coordinates Output: weight vector α α = 0 converged = false while not converged : converged = true for i in,, D : if x i is misclassified update α and set converged=false The update Is equivalent to: i! i + y i w 0 = w + y i x i 3 The sum-squared cost function: X (y i w x i ) =(y Xw) (y Xw) i The optimal solution satisfies: If we express w as: We get: X Xw = X y w = i x i = X i= X XX = X y 4 6

7 Kernel linear regression We now get that α satisfies: Compare with: XX = y X Xw = X y Which is harder to find? What have we gained? X X XX The kernel matrix The covariance matrix (d x d) Matrix of dot products associated with a dataset (n x n). Can replace it with a matrix K such that: K ij = (x i ) (x j )=k(x i, x j ) This is the kernel matrix associated with a dataset a.k.a the Gram matrix 5 6 How does that matrix look like? Kernel matrix for gene expression data in yeast: Properties of the kernel matrix The kernel matrix: K ij = (x i ) (x j )=k(x i, x j ) q q q q Symmetric (and therefore has real eigenvalues) Diagonal elements are positive Every kernel matrix is positive semi-definite, i.e. x Kx 0 8x Corollary: the eigenvalues of a kernel matrix are positive 7 8 7

8 Standard kernel functions Some tricks for constructing kernels The linear kernel Homogeneous polynomial kernel Polynomial kernel k(x, z) =x z k(x, z) =(x z) d k(x, z) =(x z + ) d Gaussian kernel (aka RBF kernel) k(x, z) =exp( x z ) Feature space: Original features All monomials of degree d All monomials of degree less or equal to d Infinite dimensional Let K(x,z) be a kernel function If a > 0 then a. K(x,z) is a kernel How do we even know that the Gaussian is a valid kernel? 9 30 Some tricks for constructing kernels Some tricks for constructing kernels Sums of kernels are kernels: Sums of kernels are kernels: Let K and K be kernel functions then K + K is a kernel Let K and K be kernel functions then K + K is a kernel What is the feature map that shows this? Feature map: concatenation of the underlying feature maps 3 3 8

9 Some tricks for constructing kernels Some tricks for constructing kernels Products of kernels are kernels: Products of kernels are kernels: Let K and K be kernel functions then K K is a kernel Let K and K be kernel functions then K K is a kernel Construct a feature map that contains all products of pairs of features The cosine kernel The cosine kernel If K(x,z) is a kernel then K 0 (x, z) = K(x, z) p K(x, x)k(z,z) If K(x,z) is a kernel then K 0 (x, z) = K(x, z) p K(x, x)k(z,z) is a kernel is a kernel This kernel is equivalent to normalizing each example to have unit norm in the feature space associated with the kernel. This kernel is the cosine in the feature space associated with the kernel K: (x) (z) cos( (x), (z)) = (x) (z) = (x) (z) p (x) (x) (z) (z)

10 Infinite sums of kernels Theorem: A function K(x, z) =K(x z) with a series expansion X K(t) = a n t n n=0 is a kernel iff a n 0 for all n. Infinite sums of kernels Theorem: A function K(x, z) =K(x z) with a series expansion X K(t) = a n t n n=0 is a kernel iff a n 0 for all n. Corollary: K(x, z) =exp( x z) is a kernel Infinite sums of kernels Theorem: A function K(x, z) =K(x z) with a series expansion X K(t) = a n t n n=0 is a kernel iff a n 0 for all n. Corollary: K(x, z) =exp( x z) is a kernel Corollary: k(x, z) =exp( x z ) is a kernel exp( x z )=exp( (x x + z z)+ (x z)) i.e. the Gaussian kernel is the cosine kernel of the exponential kernel K 0 K(x, z) (x, z) = p K(x, x)k(z,z) 39 0

Learning From Data Lecture 10 Nonlinear Transforms

Learning From Data Lecture 10 Nonlinear Transforms Learning From Data Lecture 0 Nonlinear Transforms The Z-space Polynomial transforms Be careful M. Magdon-Ismail CSCI 400/600 recap: The Linear Model linear in w: makes the algorithms work linear in x: