Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Size: px

Start display at page:

Download "Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1"

Agnes Melton
5 years ago
Views:

1 Kernel Methods Foundations of Data Analysis Torsten Möller Möller/Mori 1

2 Reading Chapter 6 of Pattern Recognition and Machine Learning by Bishop Chapter 12 of The Elements of Statistical Learning by Hastie, Tibshirani, Friedman Möller/Mori 2

Motivating Example XOR for classification,

going to higher dimension (feature space

3 Motivating Example XOR for classification, we prefer a linear line as a decision boundary: w T x that s not always possible: see XOR-type classes solution: going to higher dimension (feature space ϕ(x)) i.e. we looked at models with w T ϕ(x) Möller/Mori 3

4 Non-linear Mappings In the lectures on linear models for regression and classification, we looked at models with w T ϕ(x) The feature space ϕ(x) could be high-dimensional This was good because if data aren t separable in original input space (x), they may be in feature space ϕ(x) We d like to avoid computing high-dimensional ϕ(x) We d like to work with x which doesn t have a natural vector-space representation e.g. graphs, sets, strings N items are always (linearly!) separable in N dimensions Möller/Mori 4

5 Kernel Trick In previous lectures on linear models, we would explicitly compute ϕ(x i ) for each datapoint Run algorithm in feature space For some feature spaces, can compute dot product ϕ(x i ) T ϕ(x j ) efficiently Efficient method is computation of a kernel function k(x i, x j ) = ϕ(x i ) T ϕ(x j ) The kernel trick is to rewrite an algorithm to only have x enter in the form of dot products The menu: Kernel trick examples Kernel functions Möller/Mori 5

6 A Kernel Trick Let s look at the nearest-neighbour classification algorithm For input point x i, find point x j with smallest distance: x i x j 2 = (x i x j ) T (x i x j ) = x T i x i 2x T i x j + x T j x j If we used a non-linear feature space ϕ( ): ϕ(x i ) ϕ(x j ) 2 = ϕ(x i ) T ϕ(x i ) 2ϕ(x i ) T ϕ(x j ) + ϕ(x j ) T ϕ(x j ) = k(x i, x i ) 2k(x i, x j ) + k(x j, x j ) So nearest-neighbour can be done in a high-dimensional feature space without actually moving to it Möller/Mori 6

7 A Kernel Function Consider the kernel function k(x, z) = (1 + x T z) 2 With x, z R 2, k(x, z) = (1 + x 1 z 1 + x 2 z 2 ) 2 = = 1 + 2x 1 z 1 + 2x 2 z 2 + x 2 1z x 1 z 1 x 2 z 2 + x 2 2z2 2 (1, 2x 1, 2x 2, x 2 1, 2x 1 x 2, x 2 2 )(1, 2z 1, 2z 2, z1 2, 2z 1 z 2, z2 2)T = ϕ(x) T ϕ(z) So this particular kernel function does correspond to a dot product in a feature space (is valid) Computing k(x, z) is faster than explicitly computing ϕ(x) T ϕ(z) In higher dimensions, larger exponent, much faster Möller/Mori 7

8 Why Kernels? Why bother with kernels? Often easier to specify how similar two things are (dot product) than to construct explicit feature space ϕ. There are high-dimensional (even infinite) spaces that have efficient-to-compute kernels So you want to use kernels Need to know when kernel function is valid, so we can apply the kernel trick Möller/Mori 8

9 Valid Kernels Given some arbitrary function k(x i, x j ), how do we know if it corresponds to a dot product in some space? Valid kernels: if k(, ) satisfies: Symmetric; k(x i, x j ) = k(x j, x i ) Positive definite; for any x 1,..., x N, the Gram matrix K must be positive semi-definite: K = k(x 1, x 1 ) k(x 1, x 2 )... k(x 1, x N ) k(x N, x 1 ) k(x N, x 2 )... k(x N, x N ) Positive semi-definite means x T Kx 0 for all x then k(, ) corresponds to a dot product in some space ϕ a.k.a. Mercer kernel, admissible kernel, reproducing kernel Möller/Mori 9

10 Examples of Kernels Some kernels: Linear kernel k(x 1, x 2 ) = x T 1 x 2 ϕ(x) = x Polynomial kernel k(x 1, x 2 ) = (1 + x T 1 x 2 ) d Contains all polynomial terms up to degree d Gaussian kernel k(x 1, x 2 ) = exp( x 1 x 2 2 /2σ 2 ) Infinite dimension feature space Möller/Mori 10

11 Constructing Kernels Can build new valid kernels from existing valid ones: k(x 1, x 2 ) = ck 1 (x 1, x 2 ), c > 0 k(x 1, x 2 ) = k 1 (x 1, x 2 ) + k 2 (x 1, x 2 ) k(x 1, x 2 ) = k 1 (x 1, x 2 )k 2 (x 1, x 2 ) k(x 1, x 2 ) = exp(k 1 (x 1, x 2 )) Table on p. 296 gives many such rules Möller/Mori 11

12 More Kernels Stationary kernels are only a function of the difference between arguments: k(x 1, x 2 ) = k(x 1 x 2 ) Translation invariant in input space: k(x 1, x 2 ) = k(x 1 + c, x 2 + c) Homogeneous kernels, a. k. a. radial basis functions only a function of magnitude of difference: k(x 1, x 2 ) = k( x 1 x 2 ) Set subsets k(a 1, A 2 ) = 2 A 1 A 2, where A denotes number of elements in A Domain-specific: think hard about your problem, figure out what it means to be similar, define as k(, ), prove positive definite (Feynman algorithm) Möller/Mori 12

13 Regression - Kernelized Regularized least squares regression can also be kernelized Kernelized solution is y(x) = k(x) T (K+λI N ) 1 t vs. ϕ(x)(φ T Φ+λI M ) 1 Φ T t for original version N is number of datapoints (size of Gram matrix K) M is number of basis functions (size of matrix Φ T Φ) Bad if N > M, but good otherwise Möller/Mori 13

14 Conclusion Readings: Ch (pp ) Many algorithms can be re-written with only dot products of features We ve seen NN, perceptron, regression; also PCA, SVMs (later) Non-linear features, or domain-specific similarity measurements are useful Dot products of non-linear features, or similarity measurements, can be written as kernel functions Validity by positive semi-definiteness of kernel function Can have algorithm work in non-linear feature space without actually mapping inputs to feature space Advantageous when feature space is high-dimensional Möller/Mori 14

(Kernels +) Support Vector Machines

(Kernels +) Support Vector Machines Machine Learning Torsten Möller Reading Chapter 5 of Machine Learning An Algorithmic Perspective by Marsland Chapter 6+7 of Pattern Recognition and Machine Learning