CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed in the lecture (and vice versa Outline Non-linear models via basis functions Closer look at the SVM dual: kernel functions, kernel SVM RKHSs and Representer Theorem Kernel logistic regression Kernel ridge regression Non-linear Models via Basis Functions Let X = R d We have seen methods for learning linear models of the form h(x = sign(w x + b for binary classification (such as logistic regression and SVMs and f(x = w x + b for regression (such as linear least squares regression and SVR What if we want to learn a non-linear model? What would be a simple way to achieve this using the methods we have seen so far? One way to achieve this is to map instances x R d to some new feature vectors φ(x R n via some non-linear feature mapping φ : R d R n, and then to learn a linear model in this transformed space For example, if one maps instances x R d to n = ( + 2d + ( d 2 -dimensional feature vectors x x d x x 2 φ(x =, x d x d x 2 then learning a linear model in the transformed space is equivalent to learning a quadratic model in the original instance space In general, one can choose any basis functions φ, φ n : X R, and learn a linear x 2 d
2 Kernel Methods model over these: w φ(x + b, where w R n (in fact, one can do this for X R d as well For example, in least squares regression applied to a training sample S = ((x, y,, (x m, y m (R d R m, one would simply replace the matrix X R m d with the design matrix Φ R m n, where Φ ij = φ j (x i What is a potential difficulty in doing this? If n is large (eg as would be the case if the feature mapping φ corresponded to a high-degree polynomial, then the above approach can be computationally expensive In this lecture we look at a technique that allows one to implement the above idea efficiently for many algorithms We start by taking a closer look at the SVM dual which we derived in the last lecture 2 Closer Look at the SVM Dual: Kernel Functions, Kernel SVM Recall the form of the dual we derived for the (soft-margin linear SVM: max α 2 α i α j y i y j (x i x j + α i ( j= subject to α i y i = 0 (2 0 α i C, i =,, m (3 If we implement this on feature vectors φ(x i R n in place of x i R d, we get the following optimization problem: max α 2 ( α i α j y i y j φ(xi φ(x j + α i (4 j= subject to α i y i = 0 (5 0 α i C, i =,, m (6 This involves computing dot products between vectors φ(x i, φ(x j in R n Similarly, using the learned model to make predictions on a new test point x R d also involves computing dot products between vectors in R n : ( h(x = sign α i y i φ(x i φ(x + b i SV For example, as we saw above, one can learn a quadratic classifier in X = R 2 by learning a linear classifier in φ(r 2 R 6, where (( x x φ = x 2 x 2 x x 2 ; x 2 x 2 2 clearly, a straightforward approach to learning an SVM classifier in this space (and applying it to a new test point will involve computing dot products in R 6 (more generally, when learning a degree-q polynomial in R d, such a straightforward approach will involve computing dot products in R n for n = O(d q
Kernel Methods 3 Now, consider replacing dot products φ(x φ(x in the above example with K(x, x, where x, x R 2, K(x, x = (x x + 2 It can be verified (exercise! that K(x, x = φ K (x φ K (x, where (( x φ K = x 2 2x 2x2 2x x 2 Thus, using K(x, x above instead of φ(x φ(x implicitly computes dot products in R 6, with computation of dot products required only in R 2! In fact, one can use any symmetric, positive semi-definite kernel function K : X X R (also called a Mercer kernel function in the SVM algorithm directly, even if the feature space implemented by the kernel function cannot be described explicitly Any such kernel function yields a convex dual problem; if K is positive definite, then K also corresponds to inner products in some inner product space V (ie K(x, x = φ(x, φ(x for some φ : X V For Euclidean instance spaces X = R d, examples of commonly used kernel functions include the polynomial kernel K(x, x = (x x + q,which results in learning a degree-q polynomial threshold classifier, and the Gaussian kernel, also known as the radial basis function (RBF kernel, K(x, x = exp ( x x 2 2 2σ (where 2 σ > 0 is a parameter of the kernel, which effectivey implements dot products in an infinite-dimensional inner product space; in both cases, evaluating the kernel K(x, x at any two points x, x requires only O(d computation time Kernel functions can also be used for non-vectorial data (X = R d ; for example, kernel functions are often used to implicitly embed instance spaces containing strings, trees etc into an inner product space, and to implicitly learn a linear classifier in this space Intuitively, it is helpful to think of kernel functions as capturing some sort of similarity between pairs of instances in X To summarize, given a training sample S = ((x, y,, (x m, y m (X {±} m, in order to learn a kernel SVM classifier using a kernel function K : X X R, one simply solves the kernel SVM dual given by x 2 x 2 2 max α 2 α i α j y i y j K(x i, x j + α i (7 j= subject to α i y i = 0 (8 0 α i C, i =,, m, (9 and then predicts the label of a new instance x X according to ( h(x = sign i SV α i y i K(x i, x + b, where b = SV i SV ( y i j SV α j y j K(x i, x j
4 Kernel Methods 3 RKHSs and Representer Theorem Let K : X X R be a symmetric positive definite kernel function Let { FK 0 r } = f : X R f(x = α i K(x i, x for some r Z +, α i R, x i X For f, g FK 0 with f(x = r α ik(x i, x and g(x = s j= β jk(x j, x, define r s f, g K = α i β j K(x i, x j (0 j= f K = f, f K ( Let F K be the completion of FK 0 under the metric induced by the above norm Then reproducing kernel Hibert space (RKHS associated with K 2 Note that the SVM classifier learned using kernel K is of the form where f(x = i SV α iy i K(x i, x, ie where f F K h(x = sign(f(x + b, In fact, consider the following optimization problem: ( yi (f(x i + b f F K,b R m + + λ f 2 K F K is called the It turns out that the above SVM solution (with C = 2λm is a solution to this problem, ie the kernel SVM solution imizes the RKHS-norm regularized hinge loss over all functions over the form f(x + b for f F K, b R More generally, we have the following result: Theorem (Representer Theorem Let K : X X R be a positive definite kernel function Let Y R Let S = ((x, y,, (x m, y m (X Y m Let L : R m Y m R Let Ω : R + R + be a monotonically increasing function Then for λ > 0, there is a solution to the optimization problem of the form ( (f(x L + b,, f(x m + b, (y,, y m f F K,b R f(x = α i K(x i, x for some α,, α m R If Ω is strictly increasing, then all solutions have this form + λ Ω( f 2 K The above result tells us that even if F K is an infinite-dimensional space, any optimization problem resulting from imizing a loss over a finite training sample regularized by some increasing function of the RKHSnorm is effectively a finite-dimensional optimization problem, and moreover, the solution to this problem can be written as a kernel expansion over the training points In particular, imizing any other loss over F K (regularized by the RKHS-norm will also yield a solution of this form! Exercise Show that linear functions f : R d R of the form f(x = w x form an RKHS with linear kernel K : R d R d R given by K(x, x = x x and with f 2 K = w 2 2 The metric induced by the norm K is given by d K (f, g = f g K The completion of FK 0 is simply F K plus any limit points of Cauchy sequences in FK 0 under this metric 2 The name reproducing kernel Hilbert space comes from the following reproducing property: For any x X, define K x : X R as K x(x = K(x, x ; then for any f F K, we have f, K x = f(x
Kernel Methods 5 4 Kernel Logistic Regression Given a training sample S (X {±} m and kernel function K : X X R, the kernel logistic regression classifier is given by the solution to the following optimization problem: f F K,b R m ln ( + e yi(f(xi+b + λ f 2 K Since we know from the Representer Theorem that the solution has the form f(x = m α ik(x i, x, we can write the above as an optimization problem over α, b: α R m,b R m ln ( + e yi( m j= αjk(xj,xi+b + λ j= α i α j K(x i, x j This is of a similar form as in standard logistic regression, with m basis functions φ j (x = K(x j, x for j [m] (and w α! In particular, define K R m m as K ij = K(x i, x j (this is often called the gram matrix, and let k i denote the i-th column of this matrix Then we can write the above as simply α R m,b R m ln ( + e yi(α k i+b + λα Kα, which is similar to the form for standard linear logistic regression (with feature vectors k i except for the regularizer being α Kα rather than α 2 2 and can be solved similarly as before, using similar numerical optimization methods We note that unlike SVMs, here in general, the solution has α i 0 i [m] A variant of logistic regression called the import vector machine (IVM adopts a greedy approach to find a subset IV [m] such that the function f (x + b = i IV α i K(x i, x + b gives good performance Compared to SVMs, IVMs can provide more natural class probability estimates, as well as more natural extensions to multiclass classification 5 Kernel Ridge Regression Given a training sample S (X R m and kernel function K : X X R, consider first a kernel ridge regression formulation for learning a function f F K : f F K m ( yi f(x i 2 + λ f 2 K Again, since we know from the Representer Theorem that the solution has the form f(x = m α ik(x i, x, we can write the above as an optimization problem over α: α R m m ( 2 y i α j K(x j, x i + λ α i α j K(x i, x j, j= j= or in matrix notation, α R m m ( yi α 2 k i + λα Kα
6 Kernel Methods Again, this is of the same form as standard linear ridge regression, with feature vectors k i and with regularizer α Kα rather than α 2 2 If K is positive definite, in which case the gram matrix K is invertible, then setting the gradient of the objective above wrt α to zero can be seen to yield α = ( K + λmi m y, where as before I m is the m m identity matrix and y = (y,, y m R m Exercise Show that if X = R d and one wants to explicitly include a bias term b in the linear ridge regression solution which is not included in the regularization, then defining x ( [ ] w Id 0 X =, w =, L =, b 0 0 x m one gets the solution w = ( X X + λml X y How would you extend this to learning a function of the form f(x + b for f F K, b R in the kernel ridge regression setting?