Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Size: px

Start display at page:

Download "Kernel Methods. Konstantin Tretyakov MTAT Machine Learning"

Kenneth Stevens
5 years ago
Views:

1 Kernel Methods Konstantin Tretyakov MTAT Machine Learning

2 So far Supervised machine learning Linear models Non-linear models Unsupervised machine learning Generic scaffolding

3 So far Supervised machine learning Linear models Least squares regression, SVR Fisher s discriminant, Perceptron, Logistic model, SVM Non-linear models Neural networks, Decision trees, Association rules Unsupervised machine learning Clustering/EM, PCA Generic scaffolding Probabilistic modeling, ML/MAP estimation Performance evaluation, Statistical learning theory Linear algebra, Optimization methods

4 Coming up next Supervised machine learning Linear models Least squares regression, SVR Fisher s discriminant, Perceptron, Logistic model, SVM Non-linear models Neural networks, Decision trees, Association rules Kernel-XXX Unsupervised machine learning Clustering/EM, PCA, Kernel-XXX Generic scaffolding Probabilistic modeling, ML/MAP estimation Performance evaluation, Statistical learning theory Linear algebra, Optimization methods Kernels

5 Logistic regression, Perceptron, Max. margin, Fisher s discriminant, Linear regression, Ridge Regression, LASSO, : f x =

6 Logistic regression, Perceptron, Max. margin, Fisher s discriminant, Linear regression, Ridge Regression, LASSO, : f x = w T x + b

7 Logistic regression, Perceptron, Max. margin, Fisher s discriminant, Linear regression, Ridge Regression, LASSO, : f x = w T x + b PCA, LDA, ICA, : f x =

8 Logistic regression, Perceptron, Max. margin, Fisher s discriminant, Linear regression, Ridge Regression, LASSO, : f x = w T x + b PCA, LDA, ICA, : f x = Ax

9 Logistic regression, Perceptron, Max. margin, Fisher s discriminant, Linear regression, Ridge Regression, LASSO, : f x = w T x + b PCA, LDA, ICA, : f x = Ax K-means: c i = 1 m X i1 CCA, GLM,

10 Too much linear Logistic regression, Perceptron, Max. margin, Fisher s discriminant, Linear regression, Ridge Regression, LASSO, : f x = w T x + b PCA, LDA, ICA, : x T = Ax K-means: c i = 1 m X i1 CCA, GLM,

11 Linear is not enough Limited generalization ability

12 Linear is not enough Limited generalization ability

13 Linear is not enough Limited applicability Text? Ordinal/Nominal data? Graphs/Trees/Networks? Shapes? Graph nodes?

14 Solutions

15 Solutions Feature space Kernels

16 Solutions Feature space Nonlinear feature spaces Important idea #1 Kernels The Kernel Trick Dual representation Important idea #2 Important idea #3

17 f x = wx

18 x x φ x x, x 2, x 3

19 Nonlinear feature space x x φ x x, x 2, x 3 f x = w 1 x + w 2 x 2 + w 3 x 3

21 x φ x = (x, x 3 x)

22 x φ x = (x, x 3 x)

23 x φ x = (x, x 3 x)

24 Nonlinear feature space f x = w T φ(x)

25 +Support for arbitrary data types φ text = word counts φ graph = node degrees φ tree = path lengths

26 What if the dimensionality is high? x 1, x 2,, x m x 1 x 1, x 1 x 2,, x m x m

27 What if the dimensionality is high? x 1, x 2,, x m x 1 x 1, x 1 x 2,, x m x m O(m 2 ) elements For all k-wise products: O m k

28 The Kernel Trick Let φ x = (x 1 x 1, x 1 x 2,, x m x m ) Consider φ x, φ y = ij φ x ij φ y ij

29 The Kernel Trick Let φ x = (x 1 x 1, x 1 x 2,, x m x m ) Consider φ x, φ y = ij φ x ij φ y ij = ij x i x j y i y j

30 The Kernel Trick Let φ x = (x 1 x 1, x 1 x 2,, x m x m ) Consider φ x, φ y = ij φ x ij φ y ij = ij x i x j y i y j = ij x i y i x j y j

31 The Kernel Trick Let φ x = (x 1 x 1, x 1 x 2,, x m x m ) Consider φ x, φ y = ij φ x ij φ y ij = ij x i x j y i y j = ij x i y i x j y j = i x i y i j x j y j

32 The Kernel Trick Let φ x = (x 1 x 1, x 1 x 2,, x m x m ) Consider φ x, φ y = ij φ x ij φ y ij = = ij i x i x j y i y j = x i y i x j y j x i y i x j y j = ij 2 j i x i y i

33 The Kernel Trick Let φ x = (x 1 x 1, x 1 x 2,, x m x m ) Consider φ x, φ y = φ x ij φ y ij ij 2 = x i y i = x, y 2 i

34 The Kernel Trick Let φ x = (x 1 x 1, x 1 x 2,, x m x m ) Consider φ x, φ y = φ x ij φ y ij ij 2 = x i y i = x, y 2 i

35 The Kernel Trick Let φ x = (x 1 x 1, x 1 x 2,, x n x n ) Consider Polynomial kernel φ x, φ y = K x, y = ij = x i y i = x, y 2 i 2 φ x ij φ y ij x, y + R d

36 The Kernel Trick What about: K x, y = x, y x, y 2?

37 The Kernel Trick What about: K x, y = x, y x, y 2 = x i y i φ ij x φ ij (y) i ij

38 The Kernel Trick What about: K x, y = x, y x, y 2 x i y i φ ij x φ ij (y) i ij x 1,, x m, 0.5x 1 x 1,, 0.5x m x m, (y 1,, y m, 0.5y 1 y 1,, 0.5y m y m )

The Kernel Trick What about: K x, y = x, y + 0.5 x, y 2 x i y i + 0.

39 The Kernel Trick What about: K x, y = x, y x, y 2 x i y i φ ij x φ ij (y) i ij x 1,, x m, 0.5x 1 x 1,, 0.5x m x m, (y 1,, y m, 0.5y 1 y 1,, 0.5y m y m )

40 The Kernel Trick What about: K x, y = 1 + x, y x, y x, y x, y 4?

41 The Kernel Trick What about: K x, y = i=0 x, y i i!

42 The Kernel Trick K x, y = What about: i=0 x, y i = exp x, y i!

43 The Kernel Trick K x, y = What about: i=0 x, y i = exp x, y i! Infinite-dimensional feature space!

44 The Kernel Trick K x, y = What about: Gaussian kernel K x, y = i=0 x, y i i! = exp( γ x y 2 ) x y 2 = exp 2σ 2 = exp x, y Infinite-dimensional feature space!

45 The Kernel Trick K x, y = What about: i=0 x, y i Exponential kernel i! K x, y = exp = exp x, y x y 2σ 2 Infinite-dimensional feature space!

46 Kernels

47 Structured data kernels String kernels P-spectrum kernels All-subsequences kernels Gap-weighted subsequences kernels Graph & tree kernels Co-rooted subtrees All subtrees Random walks

48 Kernel A function K(x, y) is a kernel, if K x, y = φ x, φ y for some feature map φ.

49 Kernel matrix For a given kernel function K and a finite dataset (x 1, x 2,, x n ) the n n matrix K ij K x i, x j is called the kernel matrix.

50 Kernel matrix Let X be the data matrix, then K = XX T is the kernel matrix for the linear kernel K x, y = x T y

51 Kernel matrix Let X be the data matrix, then K = XX T is the kernel matrix for the linear kernel K x, y = x T y Let φ be a feature mapping. Then* K = φ X φ X T is the kernel matrix for the corresponding kernel K x, y = φ x, φ y.

52 Kernel theorem Not every function K is a kernel! Example?

53 Kernel theorem Not every function K is a kernel! e. g. K x, y = 1 is not Not every n n matrix is a Kernel matrix!

54 Kernel theorem Theorem: K is a kernel function K is symmetric positive semidefinite A function is positive semidefinite iff for any finite dataset {x 1, x 2,, x n } the corresponding kernel matrix is positive semidefinite.

55 Kernel closure

56 Kernel closure Feature space concatenation

57 Kernel closure Feature space scaling

58 Kernel closure Feature space tensor product

59 Kernel closure Feature map composition

60 Kernel normalization Let φ x = φ x φ x Then K x, y = φ x, φ y = φ x φ y φ x, φ y, = φ x φ y φ x 2 φ y = 2 K x, y = K x, x K y, y

61 Kernel matrix normalization Then K x, y = φ x, φ y = φ x φ y φ x, φ y, = φ x φ y φ x 2 φ y = 2 K x, y = K x, x K y, y K ij K ij K ii K jj

62 Kernel matrix centering x i x i 1 n k x k

63 Kernel matrix centering x i x i 1 n k x k

64 Kernel matrix centering x i x i 1 n k x k X X 1 n 1 n1 n T X

65 Kernel matrix centering x i x i 1 n k x k X X 1 n 1 n1 n T X XX T X 1 n 11T X X 1 n 11T X T

66 Kernel matrix centering x i x i 1 n k x k X X 1 n 1 n1 n T X XX T X 1 n 11T X X 1 n 11T X T XX T XX T 1 n 11T XX T 1 n XXT 11 T T XX T 11 T

67 Kernel matrix centering K cent XX T XX T 1 n 11T XX T 1 n XXT 11 T + 1 n 2 11T XX T 11 T K 1 n 11T K 1 n K11T + 1 n 2 11T K11 T

68 The Dual Representation Let A be the input space, and let B be the higher-dimensional feature space. Let φ: A B be the feature map. Fix a dataset {x 1, x 2,, x n } A Let w = i α i φ(x i ) B We say that α i are the dual coordinates for w.

69 Dual coordinates w = α i φ(x i ) = φ X T α = Ξ T α i Note that ΞΞ T = φ X φ X T = K Now we can do all of the useful stuff using dual coordinates only.

70 Dual coordinates Let Then w = Ξ T α u = Ξ T β 2w =

71 Dual coordinates Let Then w = Ξ T α u = Ξ T β 2w = Ξ T (2α)

72 Dual coordinates Let Then w = Ξ T α u = Ξ T β 2w = Ξ T 2α w + u =

73 Dual coordinates Let Then w = Ξ T α u = Ξ T β 2w = Ξ T 2α w + u = Ξ T (α + β)

74 Dual coordinates Let Then w = Ξ T α u = Ξ T β 2w = Ξ T 2α w + u = Ξ T α + β w, u =

75 Dual coordinates Let Then w = Ξ T α u = Ξ T β 2w = Ξ T 2α w + u = Ξ T α + β w, u = w T u = α T ΞΞ T β = α T Kβ

76 Dual coordinates Let Then w = Ξ T α u = Ξ T β 2w = Ξ T 2α w + u = Ξ T α + β w, u = w T u = αξξ T β = αkβ w u 2 =

77 Dual coordinates Let w = Ξ T α u = Ξ T β Then 2w = Ξ T 2α w + u = Ξ T α + β w, u = w T u = αξξ T β = αkβ w u 2 = w T w + u T u 2w T u =

78 Dual coordinates Let w = Ξ T α u = Ξ T β Then 2w = Ξ T 2α w + u = Ξ T α + β w, u = w T u = αξξ T β = αkβ w u 2 = w T w + u T u 2w T u =

79 Kernelization Recall the Perceptron:

80 Kernelization Recall the Perceptron: Initialize w 0 Find a misclassified example (x i, y i ) Update weights: w w + μy i x i b b + μy i

81 Kernelization Recall the Perceptron: Initialize w 0 α 0 Find a misclassified example (x i, y i ) Update weights: w w + μy i x i b b + μy i

82 Kernelization Recall the Perceptron: Initialize w 0 α 0 Find a misclassified example (x i, y i ) Update weights: w w + μy i x i α i α i + μy i b b + μy i

83 Kernelization Recall the Perceptron: Initialize α 0 Find a misclassified example (x i, y i ) Update weights: α i α i + μy i b b + μy i

84 Kernelization Recall the Perceptron: Initialize α 0 Find a misclassified example (x i, y i ) w T x i + b y i j α j x j T x i + b y i Update weights: α i α i + μy i b b + μy i

85 Kernelization Recall the Perceptron: Initialize α 0 Find a misclassified example (x i, y i ) w T x i + b y i K i α + b y i Update weights: α i α i + μy i b b + μy i

86 Kernelization Recall the Perceptron: Initialize α 0 Find a misclassified example (x i, y i ) K i α + b y i Update weights: α i α i + μy i b b + μy i

87 Quiz Today we heard three important ideas Important idea #1: Important idea #2: Important idea #3: Function/matrix K is a kernel function/matrix iff it is Dual representation: =

88 Quiz Those algoritms have kernelized versions:

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning Kernel Methods Konstantin Tretyakov (kt@ut.ee) MTAT.03.227 Machine Learning So far Supervised machine learning Linear models Least squares regression, SVR Fisher s discriminant, Perceptron, Logistic model,