CIS 520: Machine Learning Oct 09, Kernel Methods

Size: px

Start display at page:

Download "CIS 520: Machine Learning Oct 09, Kernel Methods"

Agatha Lawson
6 years ago
Views:

1 CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed in the lecture (and vice versa Outline Non-linear models via basis functions Closer look at the SVM dual: kernel functions, kernel SVM RKHSs and Representer Theorem Kernel logistic regression Kernel ridge regression Non-linear Models via Basis Functions Let X = R d We have seen methods for learning linear models of the form h(x = sign(w x + b for binary classification (such as logistic regression and SVMs and f(x = w x + b for regression (such as linear least squares regression and SVR What if we want to learn a non-linear model? What would be a simple way to achieve this using the methods we have seen so far? One way to achieve this is to map instances x R d to some new feature vectors φ(x R n via some non-linear feature mapping φ : R d R n, and then to learn a linear model in this transformed space For example, if one maps instances x R d to n = ( + 2d + ( d 2 -dimensional feature vectors x x d x x 2 φ(x =, x d x d x 2 then learning a linear model in the transformed space is equivalent to learning a quadratic model in the original instance space In general, one can choose any basis functions φ, φ n : X R, and learn a linear x 2 d

2 2 Kernel Methods model over these: w φ(x + b, where w R n (in fact, one can do this for X R d as well For example, in least squares regression applied to a training sample S = ((x, y,, (x m, y m (R d R m, one would simply replace the matrix X R m d with the design matrix Φ R m n, where Φ ij = φ j (x i What is a potential difficulty in doing this? If n is large (eg as would be the case if the feature mapping φ corresponded to a high-degree polynomial, then the above approach can be computationally expensive In this lecture we look at a technique that allows one to implement the above idea efficiently for many algorithms We start by taking a closer look at the SVM dual which we derived in the last lecture 2 Closer Look at the SVM Dual: Kernel Functions, Kernel SVM Recall the form of the dual we derived for the (soft-margin linear SVM: max α 2 α i α j y i y j (x i x j + α i ( j= subject to α i y i = 0 (2 0 α i C, i =,, m (3 If we implement this on feature vectors φ(x i R n in place of x i R d, we get the following optimization problem: max α 2 ( α i α j y i y j φ(xi φ(x j + α i (4 j= subject to α i y i = 0 (5 0 α i C, i =,, m (6 This involves computing dot products between vectors φ(x i, φ(x j in R n Similarly, using the learned model to make predictions on a new test point x R d also involves computing dot products between vectors in R n : ( h(x = sign α i y i φ(x i φ(x + b i SV For example, as we saw above, one can learn a quadratic classifier in X = R 2 by learning a linear classifier in φ(r 2 R 6, where (( x x φ = x 2 x 2 x x 2 ; x 2 x 2 2 clearly, a straightforward approach to learning an SVM classifier in this space (and applying it to a new test point will involve computing dot products in R 6 (more generally, when learning a degree-q polynomial in R d, such a straightforward approach will involve computing dot products in R n for n = O(d q

3 Kernel Methods 3 Now, consider replacing dot products φ(x φ(x in the above example with K(x, x, where x, x R 2, K(x, x = (x x + 2 It can be verified (exercise! that K(x, x = φ K (x φ K (x, where (( x φ K = x 2 2x 2x2 2x x 2 Thus, using K(x, x above instead of φ(x φ(x implicitly computes dot products in R 6, with computation of dot products required only in R 2! In fact, one can use any symmetric, positive semi-definite kernel function K : X X R (also called a Mercer kernel function in the SVM algorithm directly, even if the feature space implemented by the kernel function cannot be described explicitly Any such kernel function yields a convex dual problem; if K is positive definite, then K also corresponds to inner products in some inner product space V (ie K(x, x = φ(x, φ(x for some φ : X V For Euclidean instance spaces X = R d, examples of commonly used kernel functions include the polynomial kernel K(x, x = (x x + q,which results in learning a degree-q polynomial threshold classifier, and the Gaussian kernel, also known as the radial basis function (RBF kernel, K(x, x = exp ( x x 2 2 2σ (where 2 σ > 0 is a parameter of the kernel, which effectivey implements dot products in an infinite-dimensional inner product space; in both cases, evaluating the kernel K(x, x at any two points x, x requires only O(d computation time Kernel functions can also be used for non-vectorial data (X = R d ; for example, kernel functions are often used to implicitly embed instance spaces containing strings, trees etc into an inner product space, and to implicitly learn a linear classifier in this space Intuitively, it is helpful to think of kernel functions as capturing some sort of similarity between pairs of instances in X To summarize, given a training sample S = ((x, y,, (x m, y m (X {±} m, in order to learn a kernel SVM classifier using a kernel function K : X X R, one simply solves the kernel SVM dual given by x 2 x 2 2 max α 2 α i α j y i y j K(x i, x j + α i (7 j= subject to α i y i = 0 (8 0 α i C, i =,, m, (9 and then predicts the label of a new instance x X according to ( h(x = sign i SV α i y i K(x i, x + b, where b = SV i SV ( y i j SV α j y j K(x i, x j

4 4 Kernel Methods 3 RKHSs and Representer Theorem Let K : X X R be a symmetric positive definite kernel function Let { FK 0 r } = f : X R f(x = α i K(x i, x for some r Z +, α i R, x i X For f, g FK 0 with f(x = r α ik(x i, x and g(x = s j= β jk(x j, x, define r s f, g K = α i β j K(x i, x j (0 j= f K = f, f K ( Let F K be the completion of FK 0 under the metric induced by the above norm Then reproducing kernel Hibert space (RKHS associated with K 2 Note that the SVM classifier learned using kernel K is of the form where f(x = i SV α iy i K(x i, x, ie where f F K h(x = sign(f(x + b, In fact, consider the following optimization problem: ( yi (f(x i + b f F K,b R m + + λ f 2 K F K is called the It turns out that the above SVM solution (with C = 2λm is a solution to this problem, ie the kernel SVM solution imizes the RKHS-norm regularized hinge loss over all functions over the form f(x + b for f F K, b R More generally, we have the following result: Theorem (Representer Theorem Let K : X X R be a positive definite kernel function Let Y R Let S = ((x, y,, (x m, y m (X Y m Let L : R m Y m R Let Ω : R + R + be a monotonically increasing function Then for λ > 0, there is a solution to the optimization problem of the form ( (f(x L + b,, f(x m + b, (y,, y m f F K,b R f(x = α i K(x i, x for some α,, α m R If Ω is strictly increasing, then all solutions have this form + λ Ω( f 2 K The above result tells us that even if F K is an infinite-dimensional space, any optimization problem resulting from imizing a loss over a finite training sample regularized by some increasing function of the RKHSnorm is effectively a finite-dimensional optimization problem, and moreover, the solution to this problem can be written as a kernel expansion over the training points In particular, imizing any other loss over F K (regularized by the RKHS-norm will also yield a solution of this form! Exercise Show that linear functions f : R d R of the form f(x = w x form an RKHS with linear kernel K : R d R d R given by K(x, x = x x and with f 2 K = w 2 2 The metric induced by the norm K is given by d K (f, g = f g K The completion of FK 0 is simply F K plus any limit points of Cauchy sequences in FK 0 under this metric 2 The name reproducing kernel Hilbert space comes from the following reproducing property: For any x X, define K x : X R as K x(x = K(x, x ; then for any f F K, we have f, K x = f(x

5 Kernel Methods 5 4 Kernel Logistic Regression Given a training sample S (X {±} m and kernel function K : X X R, the kernel logistic regression classifier is given by the solution to the following optimization problem: f F K,b R m ln ( + e yi(f(xi+b + λ f 2 K Since we know from the Representer Theorem that the solution has the form f(x = m α ik(x i, x, we can write the above as an optimization problem over α, b: α R m,b R m ln ( + e yi( m j= αjk(xj,xi+b + λ j= α i α j K(x i, x j This is of a similar form as in standard logistic regression, with m basis functions φ j (x = K(x j, x for j [m] (and w α! In particular, define K R m m as K ij = K(x i, x j (this is often called the gram matrix, and let k i denote the i-th column of this matrix Then we can write the above as simply α R m,b R m ln ( + e yi(α k i+b + λα Kα, which is similar to the form for standard linear logistic regression (with feature vectors k i except for the regularizer being α Kα rather than α 2 2 and can be solved similarly as before, using similar numerical optimization methods We note that unlike SVMs, here in general, the solution has α i 0 i [m] A variant of logistic regression called the import vector machine (IVM adopts a greedy approach to find a subset IV [m] such that the function f (x + b = i IV α i K(x i, x + b gives good performance Compared to SVMs, IVMs can provide more natural class probability estimates, as well as more natural extensions to multiclass classification 5 Kernel Ridge Regression Given a training sample S (X R m and kernel function K : X X R, consider first a kernel ridge regression formulation for learning a function f F K : f F K m ( yi f(x i 2 + λ f 2 K Again, since we know from the Representer Theorem that the solution has the form f(x = m α ik(x i, x, we can write the above as an optimization problem over α: α R m m ( 2 y i α j K(x j, x i + λ α i α j K(x i, x j, j= j= or in matrix notation, α R m m ( yi α 2 k i + λα Kα

6 6 Kernel Methods Again, this is of the same form as standard linear ridge regression, with feature vectors k i and with regularizer α Kα rather than α 2 2 If K is positive definite, in which case the gram matrix K is invertible, then setting the gradient of the objective above wrt α to zero can be seen to yield α = ( K + λmi m y, where as before I m is the m m identity matrix and y = (y,, y m R m Exercise Show that if X = R d and one wants to explicitly include a bias term b in the linear ridge regression solution which is not included in the regularization, then defining x ( [ ] w Id 0 X =, w =, L =, b 0 0 x m one gets the solution w = ( X X + λml X y How would you extend this to learning a function of the form f(x + b for f F K, b R in the kernel ridge regression setting?

Introduction to Support Vector Machines

Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear