Kernel Methods Charles Elkan elkan@cs.ucsd.edu October 17, 2007 Remember the xor example of a classification problem that is not linearly separable. If we map every example into a new representation, then the problem becomes linearly separable. Specifically,... The major disadvantage of mapping points into a new space is that the new space may have very high dimension. For example, if points lie in d-dimensional Euclidean space, and we include the product of every pair of dimensions then we have quadratic blowup with the mapping f : R d R d2. We can avoid this explosion if we can achieve two objectives: Rewrite our learning algorithm so that instead of using f(x) and f(y) directly, it only uses the dot-product K(x, y) = f(x) f(y). Compute dot-products K(x, y) in some indirect way, that is without computing f(x) and f(y) explicitly. These two objectives are called the kernel trick. The first one is called kernelizing the learning algorithm. This means rewriting the algorithm so that it uses training examples solely by storing them, and by computing dot-products involving them. When we map examples into a higher-dimensional space, then linear classification in the new space is typically equivalent to nonlinear classification in the original space. Thus the kernel trick lets us extend a linear classification method such as the perceptron to a nonlinear method. The kernel trick was first published in 1964 by Aizerman et al., but it was not explored until the 1990s. Then it was first investigated extensively in the context of support vector machines, but more recently it has been applied to many other learning methods. For a simple example, consider kernelizing the perceptron. Remember the basic algorithm: 1
w := 0 repeat for T epochs: for i = 1 to i = m if y i sign(w x i ) then w := w + y i x i At each point in the execution of the algorithm, the vector w is a sum of data points w = j y jx j for j in a subset J of {1, 2,..., m}. So w x = j y jx j x. Now suppose we have a a kernel function K. The prediction for example s is just sign( j y jk(x j, x). Given m training points, how many kernel calculations do we need to do in order to make a prediction? Because J is a subset of {1, 2,..., m}, we could need to do up to m kernel calculations. This is true regardless of how many epochs of training we do, since there are still only m different possible values for K(x j, x). Every training example x j that is actually used in the final classifier is called a support vector. The kernelized perceptron is thus a type of support vector machine, but conventionally this name is usually reserved for a different learning approach. Intuitively, the perceptron algorithm is a good one to kernelize because the proof of its convergence is independent of the dimensionality of the data. This means that the dimensionality of f(x) is irrelevant as a determinant of the generalization ability of the learned classifier. Instead, what is relevant is R/δ as measured in the new space. Now that we have seen how to kernelize at least one learning algorithm, we can pay attention to the second part of the kernel trick, which is to compute interesting kernel functions efficiently. Suppose data points x live in d-dimensional Euclidean space. Suppose we want to re-represent them with every quadratic combination of features, If we can do this, we can learn separating surfaces that are quadratic in the original space R d. (Quadratic surfaces include circles, ellipses, parabolas, and their analogs in high dimensions.) Suppose we wish to re-represent a point x in a quadratic way, that is as the vector of products f(x) =..., x i x j,... for all i and j including i = j. This vector has length d 2. The dot-product after re-representation will be the sum of all d 2 terms x i x j y i y j. How can we compute this efficiently, i.e. indirectly? Consider (x y) 2 = (x 1 y 1 + x 2 y 2 +... x d y d ) 2. This is a sum of all d 2 possible terms of the form x i y i x j y j. By commutativity x i y i x j y j = x i x j y i y j and it follows that f(x) f(y) = (x y) 2. 2
Now consider (1 + x y) 2 = 1 + 2x y + f(x) f(y). This equals g(x) g(y) where g(x) = 1,..., 2x i,... x i x j,... which is a re-representation of x that includes all products of degree 0, 1, and 2 of elements of x. By definition, then, every quadratic function of x can be written as w g(x) for some weight vector w; the weight vector w can adjust for the fact that some components of g(x) are scaled by 2. In general, the function K(x, y) = (1 + x y) n is called the polynomial kernel of degree n. The binomial theorem says that n ( ) n (a + b) n = a n k b k. k Let a = 1 and b = x y. We obtain k=0 K(x, y) = (1 + x y) n = n k=0 ( ) n (x y) k. k It is not hard to show that K(x, y) = h(x) h(y) where h(x) is a vector that includes all products of degree n and lower of components of x, with scaling factors on the components. Kernels can also be defined for data points that are not real-valued vectors. As long as f(x) is a vector of the same length regardless of x, x itself need not be real-valued and need not be of fixed size; for example x may be a sequence of arbitrary length. Let A be an alphabet, and consider some total ordering of all strings over this alphabet, for example a lexicographic ordering over m, s where m is the length of s. Let s n be string number n in this ordering. Now let x be any string and define f n (x) to be the number of times s n appears as a substring in x, including overlapping appearances. Let f(x) be the representation f o (x), f 1 (x),.... Note that the vector f(x) is always of infinite length, but for any string x of length m there exists a threshold T m i=0 A i such that all components f t (x) are zero for t > T. Hence for any two strings the dot-product f(x) f(y) is a finite integer. Given two strings, how do we compute K(x, y) = f(x) f(y)? For each substring of x, count how often it appears in y, and sum up these counts. We can do this in O( x y ) time using dynamic programming, where y is the length of string y. 3
Intuitively, a kernel is a similarity function. However, we cannot use just any heuristic similarity function as a genuine kernel. To be a genuine kernel, a function must be a dot-product in some space. Mathematically, Mercer s theorem says when a function K has this property. First we need two definitions. Definition: The function K(x, y) : X X R is symmetric if and only if K(x, y) = K(y, x) for all x and y in X. Definition: The function K(x, y) : X X R is non-negative definite if and only if K(x i, x j )c i c j 0 i,j for every finite subset {x 1,..., x n } of X and every subset {c 1,..., c n } of real numbers. The sum above is equal to c T Kc where c is the column vector c 1,..., c n T and K is a matrix whose ijth entry is K(x i, x j ). A symmetric non-negative definite matrix has non-negative eigenvalues. Now we can state Mercer s theorem. Theorem: [James Mercer, 1909] There exists a d and a function f : X R d such that K(x, y) = f(x) f(y) if and only K(x, y) is symmetric and non-negative definite. One of the most widely used kernels is based on Euclidean distance. The Gaussian kernel is defined to be K(x, y) = exp x y 2 /s 2 where s is a parameter. This is also called the radial basis function kernel with parameter s. It can be proved that this function K is positive semi-definite, so we can use it as a kernel without knowing explicitly what space it corresponds to. If we learn a linear classifier using a Gaussian kernel, the result is similar to a nearest-neighbor classifier. Given a test example z, we compute its distance to every support vector selected by the training algorithm. For support vectors that are not near z, their label will not contribute to the prediction for z. With a Gaussian kernel, the support vectors are the centers of the radial basis functions. A remarkable feature of the kernelized perceptron with a Gaussian kernel is that the training algorithm automatically chooses how many centers to use. However, which ones are chosen, and how many, depends on the width s 2 of the kernel, which is not chosen automatically. Kernels possess useful closure properties. Given two kernels we can make new ones by adding and multiplying them with each other, and by adding or multiplying by scalar constants. We can also raise a kernel to a positive integer power, and hence exponentiate them also, because the exponential function is the limit of a series of polynomials. We may also normalize a kernel, i.e. use 4
N(x, y) = K(x, y)/ K(x, x)k(y, y). The closure properties for kernels let us use, for example, the general polynomial kernel K(x, y) = (x y + R) d for constant R 0. The great flexibility in defining kernels invites the question of which kernels are better. This question has no definitive answer. At one extreme, a kernel that gives a diagonal kernel matrix is useless. At the other extreme, a perfect kernel divides the data into disjoint subspaces, each of which has the same label. 5