Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Size: px

Start display at page:

Download "Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space"

Juliana Wilkinson
5 years ago
Views:

1 to The The A s s in to Fabio A. González Ph.D. Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá April 2, 2009

2 to The The A s s in 1 Motivation Outline 2 The Mapping the input space to the feature space Calculating the dot product in the feature space 3 The 4 A Primal linear regression Dual linear regression 5 Mathematical characterisation Visualizing kernels in input space 6 s 7 s in

3 to Motivation Problem 1 How to separate these two classes using a linear function? The The A s s in

4 to Problem 2 Motivation The The A s s in How to do symbolic regression? Σ = {A, C, G, T } f : Σ d R ACGTA 10.0 GTCCA 11.3 GGTAC 1.0 CCTGA

5 to The Problem 1 How to separate these two classes using a linear function? Mapping the input space to the feature space Calculating the dot product in the feature space The A s s in

6 to Solution The Mapping the input space to the feature space Calculating the dot product in the feature space The A s s in Map to R 3 : φ : R 2 R 3 (x, y) (x 2, y 2, xy)

7 to Solution The Mapping the input space to the feature space Calculating the dot product in the feature space The A s s in Map to R 3 : φ : R 2 R 3 (x, y) (x 2, y 2, xy)

8 to Input space vs. feature space The Mapping the input space to the feature space Calculating the dot product in the feature space The A s s in

9 to The Mapping the input space to the feature space Calculating the dot product in the feature space The A s s in Dot product in the feature space φ(x), φ(z) = φ : R 2 R 3 (x 1, x 2 ) (x 2 1, x 2 2, 2x 1 x 2 ) (x1 2, x2 2, 2x 1 x 2 ), (z1 2, z2 2, 2z 1 z 2 ) = x 2 1 z x 2 2 z x 1 x 2 z 1 z 2 = (x 1 z 1 + x 2 z 2 ) 2 = x, z 2 A function k : X X R such that k(x, z) = φ(x), φ(z) is called a kernel Morale: you don t need to apply φ explicitly to calculate the dot product in the feature space!

10 to induced feature space The Mapping the input space to the feature space Calculating the dot product in the feature space The A s s in The feature space induced by the kernel is not unique: The kernel k(x, z) = x, z 2 also calculates the dot product in the four dimensional feature space: φ : R 2 R 4 (x 1, x 2 ) (x 2 1, x 2 2, x 1 x 2, x 2 x 1 ) The example can be generalised to R n

11 to The Process The The A s s in

12 to The Approach The The A s s in items are embedded into a vector space called the feature space Linear relations are sought among the images of the data items in the feature space The pattern analysis algorithm are based only on the pairwise dot products, they do not need the actual coordinates of the embedded points The pairwise dot products in the feature space could be efficiently calculated using a kernel function

13 to The The A Primal linear regression Dual linear regression s s in Problem definition Given a training set S = {(x 1, y 1 ),..., (x l, y l )} of points x i R n with corresponding labels y i R the problem is to find a real-valued linear function that best interpolates the training set: n g(x) = w, x = w x = w i x i i=1 If the data points were generated by a function like g(x), it is possible to find the parameters w by solving where X = Xw = y x 1. x l

14 to Graphical representation The The A Primal linear regression Dual linear regression s s in

15 to Loss function The The A Primal linear regression Dual linear regression s s in Minimize l L(g, S) = L(w, S) = (y i g(x i )) 2 = l = L(g, (x i, y i )) i=1 l i=1 i=1 This could be written as L(w, S) = ξ 2 = (y Xw) (y Xw) ξ 2 i

16 to Solution The The A Primal linear regression therefore L(w, S) w = 2X y + 2X Xw = 0, X Xw = X y, Dual linear regression s s in and w = (X X) 1 X y

17 to Dual representation of the problem The The A Primal linear regression Dual linear regression w = (X X) 1 X y = X X(X X) 2 X y = X α So, w is a linear combination of the training samples, w = l i=1 α ix i. s s in

18 to The The A Primal linear regression Dual linear regression s s in From the solution of the primal problem: then using the dual representation then and X Xw = X y, XX Xw = XX y, XX XX α = XX y, α = (XX ) 1 y, g(x) = w x = α Xx. Solution Note: XX may be close to singular, or singular according to machine precision.

19 to Ridge regression The The A Primal linear regression Dual linear regression s s in If XX is singular, the pseudo-inverse could be used: to find the w that satisfies X Xw = X y with minimal norm. Optimisation problem: min L λ(w, S) = min λ w w w 2 + l (y i g(x i )) 2, i=1 where λ defines the trade-off between norm and loss. This controls the complexity of the model (the process is called regularization).

20 to Solution The The A Primal linear regression Taking the derivative and making it equal to zero: X Xw + λw = (X X + λi n )w = X y, where I n is an identity matrix of n n dimension, then, w = (X X + λi n ) 1 X y. Dual linear regression s In terms of α: w = λ 1 X (y Xw) = X α, s in then α = λ 1 (y Xw) = (XX + λi l ) 1 y.

21 to Prediction function The The A Primal linear regression Dual linear regression s s in g(x) = w, x = l l α i x i, x = α i x i, x i=1 i=1

22 to The The A Primal linear regression Dual linear regression s s in Ridge regression as a kernel method The Gram matrix G = XX is the matrix of dot products x 1 x 1, x 1 x 1, x l G = XX =. [x 1 x l ] = x l, x 1 x l, x l x l G may be replaced by a general kernel matrix, K, with k ij = k(x i, x j ) = < φ(x i ), φ(x j ) > The α s are calculated as: α = (K + λi l ) 1 y The predicted function is approximated as: k(x, x l 1 ) g(x) = α i k(x, x i ) = y (K + λi l ) 1. i=1 k(x, x l )

23 to Characterisation The The A Mathematical characterisation Visualizing kernels in input space s s in Theorem (Mercer s Theorem) A function k : X X R, which is either continuous or has a countable domain, can be decomposed k(x, z) = φ(x), φ(z) into a feature map φ into a Hilbert space F applied to both its arguments followed by the evaluation of the inner product in F if and only if it satisfies the finitely positive semi-definite property.

24 to Some kernel functions The The A Mathematical characterisation Visualizing kernels in input space Assume k 1 and k 2 kernels: k(x, z) = p(k 1 (x, z)). p a polynomial with positive coefficients. k(x, z) = exp(k 1 (x, z)). k(x, z) = exp( x z 2 /(2σ 2 )). Gaussian kernel. k(x, z) = k 1 (x, z)k 2 (x, z) s s in

25 to Embeddings corresponding to kernels The The A Mathematical characterisation Visualizing kernels in input space It is possible to calculate the feature space induced by a kernel (Mercer s Theorem) This can be done in a constructive way The feature space can even be of infinite dimension. s s in

26 to How to visualize? The The A Mathematical characterisation Visualizing kernels in input space s s in Choose a point in input space p 0 Calculate the distance from another point x to p 0 in the feature space: φ(p 0 ) φ(x) 2 F = φ(p 0 ) φ(x), φ(p 0 ) φ(x) F = φ(p 0 ), φ(p 0 ) F + φ(x), φ(x) F 2 φ(p 0 ), φ(x) F = k(p 0, p 0 ) + k(x, x) 2k(p 0, x) Plot f (x) = φ(p 0 ) φ(x) 2 F

27 to Identity kernel The k(x, z) = x, z The A Mathematical characterisation Visualizing kernels in input space s s in

28 to Quadratic kernel (1) The k(x, z) = x, z 2 The A Mathematical characterisation Visualizing kernels in input space s s in

29 to Identity kernel (2) The The k(x, z) = x, z 2 A Mathematical characterisation Visualizing kernels in input space s s in

30 to Gaussian kernel The k(x, z) = e x z 2 2σ 2 The A Mathematical characterisation Visualizing kernels in input space s s in

31 to Basic computations in feature space The The A s s in Means Distances Projections Covariance

32 to Classification and regression The The A s Support Vector s Support Vector Regression Fisher Discriminant Perceptron s in

33 to Dimensionality reduction and clustering The The A s s in PCA CCA k-means SOM

34 to s in complex structured data The The A s s in Since kernel methods do not require an attribute-based representation of objects, it is possible to perform learning over complex structured data (or unstructured data) We only need to define a dot product operation (similarity, dissimilarity measure) Examples: Strings Texts Trees Graphs

35 to Problem 2 The The A s s in How to do symbolic regression? Σ = {A, C, G, T } f : Σ d R ACGTA 10.0 GTCCA 11.3 GGTAC 1.0 CCTGA

36 to The The A s s in Define a kernel on strings k : Σ d Σ d R Solution Use the kernel along with a kernel learning regression algorithm to find the regression function What is a good candidate for k? a function that measures string similarity higher value for similar strings, smaller value for different strings k(s 1... s d, t 1... t d ) = equal(s i, t i ) = k(actag, CCTCG) =? n equal(s i, t i ) i=1 { 1 if s i = t i 0 otherwise

37 to Induced Feature Space The The A s s in What is the feature space induced by k? φ : Σ d R 4d s 1... s d (x1 1,..., x4 1, x1 2,..., x4 2,..., x1 d,..., x4 d ) (1, 0, 0, 0) if s j = A (x j 1,..., x j 4 ) = (0, 1, 0, 0) if s j = C (0, 0, 1, 0) if s j = G (0, 0, 0, 1) if s j = T

38 to References The The A Shawe-Taylor, J. and Cristianini, N for. Cambridge University Press. s s in

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1 Kernel Methods Foundations of Data Analysis Torsten Möller Möller/Mori 1 Reading Chapter 6 of Pattern Recognition and Machine Learning by Bishop Chapter 12 of The Elements of Statistical Learning by Hastie,