Kernel Methods. Machine Learning A W VO

Kernel Methods Machine Learning A 708.063 07W VO

Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance vector machine)

Introduction Many linear parametric models can be re-cast into an equivalent dual representation in which a linear prediction is based on linear combinations of kernel functions. The kernel concept was introduced into the field of pattern recognition by Aizerman et al. (1964). It was reintroduced into machine-learning in the context of large margin classifiers by Bose et al. (1992) giving rise to the technique of Support Vector Machines.

Dual representation Models which are based on a fixed nonlinear feature space mapping Φ : X H can be reformulated in terms of a dual representation in which the kernel function arises naturally. Consider the linear (regression) model T y=w x n, M x n X, w ℝ, H ℝ with a regularized sum-of-squares error function 2 1 N T T E w = n=1 w x n t n w w 2 2 w opt = T I N 1 T t, t= t 1,..., t N T M

Gram matrix Introduce a kernel function k x, x ' = x T x '. and the Gram matrix K, defined as K = T K nm=k x n, x m K nm= x n T x m, where denotes the design matrix.

Dual formulation The dual formulation allows the solution to the least-squares problem to be expressed entirely in terms of the kernel function in a N dim. space y x =wt x =a T k x k n x =k x n, x a= K I N 1 t, t= t 1,..., t N T. Known as dual formulation, because by noting that a can be expressed in terms of φ (x) we will recover the original formulation (for t =K t' / const and λ=λ' / const one obtains the same ESSE in kernel space)

The kernel concept In learning one wants to be able to generalize to unseen data points. For training data x1, y 1,..., x N, y N X Y and a new test sample x we want to choose y so that (x,y) is in some sense similar to the training samples. Therefore we need notions of similarity: Outputs y: Error- or loss function Inputs x : Symmetric kernel function k : X X ℝ x, x ' k x, x '

Similarities in the input space We further focus on a simple similarity measure: The dot product, But X may be no dot-product space. We therefore introduce a mapping : X H x x := x to a dot product space H. Benefits: 1. k x, x ' := x, x ' = x, x ' 2. We can deal with patterns geometrically 3. Freedom to choose. Canonical dot-product T x x'

Remark: Linear models Linear models can be completely formulated as dot products: y x =sign w, x b

Questions What is the benefit of feature maps Φ? What is the relationship between kernel and feature maps? What are the advantageous and disadvantageous of the kernel approach?

Benefit of nonlinear feature maps Classification can be done in 3D with a single hyperplane.

Relationship between kernels and feature maps Question: What kind of kernels k(x,x') admit a representation as a dot product in a feature space, i.e. given a kernel k(x,x') can we always construct a dot product space H and a map Φ mapping into it? Whenever we have a map Φ into a dot product space H, can we always construct a kernel k(x,x')? kernel k(x,x') map Φ into a dot product space H. Answer: Yes, if the kernel is positive semidefinite.

Positive semidefinite kernels Definition (Positive semidefinite matrix): A complex m m matrix K satisfying i, j c i c j K ij 0 for all ci ℂ is called positive semidefinite. Similarly, a real symmetric m m matrix K satisfying this relation for all ci ℝ is called positive semidefinite Definition (Positive semidefinite kernel): Let X be a nonempty set. A function k on X X which for all N ℕ, x1,..., x N X gives rise to a positive semidefinite Gram matrix is called a positive semidefinite kernel. The definitions for positive semidefinite kernels and positive semidefinite matrices differ in the fact that in the former case we are free to choose the points on which the kernel is evaluated.

Properties of the Gram matrix 1. Positivity on the diagonal 2. Symmetry K i,i=k x i, x i 0 for all x i X K i, j=k x i, x j =k xi, x j =K j,i 3. All eigenvalues are nonnegative. 4. Cauchy-Schwarz inequality is fulfilled 2 K i, j K i,i K j, j Proof: Eigenvalues of K are positive and so is their product, the det. 0 det K =K 1,1 K 2,2 K 1,2 K 2,1 =K 1,1 K 2,2 K 1,2 K 1,2 =K 1,1 K 2,2 K 1,2 2

Kernels from feature maps Whenever we have a map Φ into a dot product space H, we obtain a positive semidefinite kernel via k x, x ' := x, x ' c i ℝ, xi X, i=1,..., m 2 i, j c i c j k xi, x j = i ci x i, j c j x j = i ci x i 0

Feature maps from kernels Define a map from X into the space of functions mapping X into ℝ : X ℝ X x k, x. X We construct a feature space by X ℝ :={ f : X ℝ} 1. Turn the image of Φ into a vector space 2. Define a dot product 3. Show that the dot product satisfies k x, x ' := x, x ' 4. For convenience turn the dot product space into a Hilbert space

1. Create a vector space First we define a vector space by taking linear combinations of the form N f = i=1 i k, x i N ℕ, i ℝ, x1,..., x N X f : X ℝ

2. Define a dot product Next we define a dot product between f and another function N' g = i =1 i k, xi ' N ' ℕ, i ℝ, x1 ',..., x N ' X as N N' f, g = i=1 j=1 i j k x i, x j ' Note that the expansion coefficients need not to be unique.

Properties of a dot product 1. A dot product, : H H ℝ x, x ' x, x ' is a symmetric bilinear form ax bx ', x ' ' =a x, x ' ' b x ', x ' ', x ' ', ax bx ' =a x ' ', x b x ' ', x ' Proof: N N' f, g = i=1 j=1 i j k x i, x j ' = g, f N' f, g = j =1 j f x j ' N f, g = i=1 i g xi

Properties of a dot product 2. A dot product is positive semidefinite, i.e. f, f 0, with equality only for f =0. Proof: From the definition of the dot product and the p. s. d. of the kernel follows N N' f, f = i=1 j=1 i j k xi, x j ' 0 and from the Cauchy-Schwarz inequality for kernels follows f x 2= k, x, f 2 k x, x f, f f, f =0 f =0.

3. Kernel is the dot product The kernel k is the representer of evaluation N N' f, g = i=1 j=1 i j k x i, x j ' k, x, k, x ' =k x, x ' and is therefore also called reproducing kernel. Therefore we get x, x ' =k x, x '

4. RKHS We finally turn the dot product space into a Hilbert space by a fairly simple mathematical trick, resulting in a Reproducing Kernel Hilbert Space RKHS: We complete the dot product space in the norm by adding the limit points of Cauchy sequences that are convergent in the norm. Cauchy sequence: Reason: This is has some mathematical advantages, e. g. it is always possible to define projections.

Mercer theorem Mercer's theorem is the traditional way to introduce the kernel trick. Mercer's theorem uses L2 norm in contrast to the RKHS. But any two separable Hilbert spaces are isometrically isomorphic, i.e. it's possible to define a one-to-one linear map between spaces which preserves dot product.

Mercer theorem 1. Mercer kernels are positive definite kernels. Therefore they are also reproducing kernels. 2. Different feature spaces can be constructed for the same kernel. 3. As long as only dot products are considered, spaces can be regarded as identical. 4. Practically we never make use of RKHS or Mercer maps, but only deal with kernel functions.

Representer theorem Theorem: (Representer theorem) Denote by Ω:[0, ) ℝ a strictly monotonic increasing function, by X a set, and by c:(x ℝ2)N ℝ { } an arbitrary loss function. Then each minimizer f H of the regularized risk c x 1, y1, f x 1,..., x N, y N, f x N f H admits a representation of the form m f x = i=1 i k x i, x Although we might be trying to solve an optimization problem in an infinite dimensional space H, containing linear combinations of points centered on arbitrary points of X, the solution lies in the span of N particular kernels those centered on the training points.

Representer theorem Proof: We decompose any f into a part contained in the span of the kernels centered on the training samples and a part that is orthogonal N f x = f x f x = i=1 i k x i, x f x, f x j = f., k x j,. f H, f j {1,..., N } N = i=1 i k x i, x j f., k x j,. N = i=1 i k x i, x j m 2 2 m 2. f = i i k xi, f i ik x i,. Therefore the risk function is minimized if f =0, k x i,. =0, i { 1,..., N }

Examples of kernels Polynomial kernels Inhomogeneous polynomial kernels Gaussian kernels Sigmoid kernel (not p.s.d.) Prior knowledge of the problem helps in designing the right kernel for sophisticated problems (e.g. bioinformatics, text categorization, etc.)

Advantages of the kernel approach 1. Kernel trick 1: Simple computation of the dot product in a potentially infinite dimensional feature space (see RBF) by means on the kernel function 2. Kernel trick 2: Given an algorithm formulated in terms of a p.s.d. kernel k one can formulate another algorithm by replacing k with another p.s.d. kernel. 3. Simple construction of p.s.d. kernels from other p.s.d. kernels k x, x' k x, x' k x, x' k x, x' k x, x' = = = = = k 1 x, x ' k 2 x, x ' k 1 x, x ' k 2 x, x ' exp k 1 x, x ' f x k 1 x, x ' f x ', c k1 x, x ', f is any function c 0

Disadvantages of the kernel approach In the dual formulation the solution for a regression model is obtain by inverting a N x N matrix, which for standard methods requires O(N3). This limits the the number of training samples < 10000. For predicting the output for a new test sample the evaluation of N kernel functions is required. A solution for the later problem are sparse kernel machines, for which the prediction only depends on a subset of training data points.

Summary Kernel: Similarity measure of inputs that is calculated using dot products in high dimensional spaces Linear models for regression and classification can be computed only with dot products. Any p.s.d. kernel corresponds to a dot product in feature space Kernel Trick: Compute high-dimensional dot products without computing feature map.

Examples of kernel machines Non-sparse kernel machines: Gaussian processes Sparse kernel machines: Nonlinear PCA Support vector machines Support vector regression Relevance vector machines

Nonlinear PCA Linear Principle Component Analysis (PCA): Linear PCA is an orthogonal transformation of the coordinate system. The new coordinate system is obtained by projecting the data on the principle components, orthogonal axes in direction of the largest variance.

Kernel PCA Nonlinear PCA is its generalization to nonlinear transformation.

Linear PCA d Assume that that observations x n ℝ are centered, i.e. have mean 0. PCA finds the principle axes by diagonalizing the covariance matrix N 1 T C= i =1 xi xi N C is positive semidefinite, and can thus be diagonalized with nonnegative eigenvalues. This is done by the solving the eigenvalue equation i v i =C v i, for eigenvalues i=1,..., d 1... d 0 and nonzero eigenvectors v i ℝ d {0}.

Kernelizing PCA All eigenvectors vi with i 0 lie in the span of x1,...,xn: N N 1 i i v i =C v i= j =1 x j, v i x j v i = j =1 j x j N the eigenvalue equation is therefore equivalent to i x j, v i = x j, C v i, Substituting vi for all j=1,..., N into this equation on obtains the EV problem for N i i =K i, i= i1,..., in T i=1,..., N i

Kernelizing PCA Requiring i i 1=N i, v i to be normalized we obtain Projections on the principle axes can be obtained by N i j N i v i= j=1 x j v i, x = j=1 j x j, x Because only dot products are involved they can be replaced by non-linear kernels, corresponding to linear PCA in a high-dimensional feature space.

Kernel PCA Algorithm: 1. Calculate the Gram matrix K and diagonalize it to get 2. Normalize principle axes v i by setting i i and i i 1=N i, 3. Extract principle components by projecting on the principle axes N i v i, x = j=1 j k x j, x, i=1,..., N

Centering For the sake of simplicity we have assumed that the data points in the feature space are centered. Centering in the feature space is not as easy as in the input space but can be achieved by K i, j= K 1 N K K 1N 1 N K 1 N i, j where 1N denoted the matrix containing 1/N in all elements.

Example: contour lines (const PC values) v1 v2 v3 k = x T x k = x T x 2 k = x T x 3 k = x T x 4

Contour lines for RBF kernels

Support vector regression Extension of Support Vector Machines (SVM) to regression problems while preserving the property of sparseness. Basic idea behind SVMs: Maximize the margin defined by the support vectors in a feature space H

SVM revisited Maximize the distance between of the closest data point to the hyperplane T t n y x n t n w x n b = w w by solving argmax w, b { 1 T minn [ t n w x n b ] w }

The canonical representation The optimization problem argmax w, b { 1 T minn [ t n w x n b ] w } can reformulated by rescaling w w b b resulting in an unchanged distance of a data point to the hyperplane and and the constraint minimization problem solved by quadratic programming T t n w x n b 1 with equality only for SVs argmax w, b w 1 argmin w, b w 2

Overlapping class distributions For overlapping class distributions we define slack variables 0, if x n is on the correct side n= y x n t n, otherwise { } and perform the constrained optimization T t n w x n b 1 n argmin w, b E reg w with equality only for SVs N 2 E reg w =C n=1 n w

Support vector regression Basic idea behind Support Vector Regression (SVR): For SVM support vectors are data points on or within the margin or data points on the wrong side of the hyperplane. For SVR support vectors are data points far away from the target value. SVM SVR

Regularized error function For support vector regression we introduce an - insensitive error function if y x t E y x t = 0, y x t, otherwise { } and minimize the regularized error function obtained by replacing ESSE with EC 1 E reg=c n=1 E y x t w 2 2 N

Points outside the -tube For points outside the -tube we define two kinds of slack variables n= { { 0, otherwise y x n t n, if y is above the tube 0, otherwise n= t n y x n, if y is below the tube } } N 1 1 2 2 E reg=c n=1 E y x t w =C n=1 n n w 2 2 N

Reminder: Lagrangian multipliers To solve the optimization problem of maximizing a function f x under the constrain g x =0 one introduces Lagrangian multipliers and maximizes the Lagrangian function L x, f x g x x L= f g =0 L=g=0 At the solution the two gradients for the function f and g must be parallel and a parameter λ must exist such that they cancel.

Reminder: Lagrangian multipliers To solve the optimization problem of maximizing a function under the constrain g x 0 one observes to cases: 1. The maximum of f lies within f x g x 0 =0 2. The maximum of f lies on 0, g x =0 f = g The solution is therefore obtained by maximizing L subject to the Karush-Kuhn-Tucker (KKT) conditions: x L= f g =0 g x 0 0 g x = 0

Lagrangian for SVR The constraints for SVR n 0, n 0, t n y x n n t n y x n n lead to the Lagrangian where are Lagrangian multipliers.

Dual problem Setting the derivatives of L with respect to w, b and to zero one obtains after elimination of these variables the Lagrangian L in the dual formulation with the KKT conditions n 0, n 0 0 a n C, 0 a n C an n y n t n =0 a n n y n t n =0 C a n n=0, C a n n=0 ( eliminating μn) ( eliminating μn)

Support vectors for SVR Solving for w we see that new predictions can be made using N y x = n=1 a n a n k x, x n b From the KKT cond. we obtain the support vectors a n n y n t n =0 a n n y n t n =0 1. a n / a n is only nonzero for points above/below the tube (or boundary) 2. The two ( ) terms are incompatible, therefore either a n=0or a n=0.