Similarity and kernels in machine learning

Size: px

Start display at page:

Download "Similarity and kernels in machine learning"

Arthur Wade
5 years ago
Views:

1 1/31 Similarity and kernels in machine learning Zalán Bodó Babeş Bolyai University, Cluj-Napoca/Kolozsvár Faculty of Mathematics and Computer Science MACS 2016 Eger, Hungary

2 2/31 Machine learning Overview of the presentation Similarity. Similarity in (machine) learning Kernels Kernel methods Examples of general purpose kernels Kernels and similarities A sample/simple method: prototype learning The representer theorem Dimensionality The kernelization period Semi-supervised learning and kernels Assumptions in SSL Humans and SSL Data-dependent kernels Reweighting cluster kernels A toy dataset

3 3/31 Machine learning Arthur Samuel, 1959: field of study that gives computers the ability to learn without being explicitly programmed [... ] machine learning is now an independent and mature field that has moved beyond psychologically or neurally inspired algorithms towards providing foundations for a theory of learning that is rooted in statistics and functional analysis [Jäkel et al., 2007] Machine learning = supervised learning classification, regression unsupervised learning clustering, density estimation reinforcement learning + semi-supervised learning (classification)

4 3/31 Machine learning Arthur Samuel, 1959: field of study that gives computers the ability to learn without being explicitly programmed [... ] machine learning is now an independent and mature field that has moved beyond psychologically or neurally inspired algorithms towards providing foundations for a theory of learning that is rooted in statistics and functional analysis [Jäkel et al., 2007] Machine learning = supervised learning classification, regression unsupervised learning clustering, density estimation reinforcement learning + semi-supervised learning (classification)

5 3/31 Machine learning Arthur Samuel, 1959: field of study that gives computers the ability to learn without being explicitly programmed [... ] machine learning is now an independent and mature field that has moved beyond psychologically or neurally inspired algorithms towards providing foundations for a theory of learning that is rooted in statistics and functional analysis [Jäkel et al., 2007] Machine learning = supervised learning classification, regression unsupervised learning clustering, density estimation reinforcement learning + semi-supervised learning (classification)

6 3/31 Machine learning Arthur Samuel, 1959: field of study that gives computers the ability to learn without being explicitly programmed [... ] machine learning is now an independent and mature field that has moved beyond psychologically or neurally inspired algorithms towards providing foundations for a theory of learning that is rooted in statistics and functional analysis [Jäkel et al., 2007] Machine learning = supervised learning classification, regression unsupervised learning clustering, density estimation reinforcement learning + semi-supervised learning (classification)

7 3/31 Machine learning Arthur Samuel, 1959: field of study that gives computers the ability to learn without being explicitly programmed [... ] machine learning is now an independent and mature field that has moved beyond psychologically or neurally inspired algorithms towards providing foundations for a theory of learning that is rooted in statistics and functional analysis [Jäkel et al., 2007] Machine learning = supervised learning classification, regression unsupervised learning clustering, density estimation reinforcement learning + semi-supervised learning (classification)

8 3/31 Machine learning Arthur Samuel, 1959: field of study that gives computers the ability to learn without being explicitly programmed [... ] machine learning is now an independent and mature field that has moved beyond psychologically or neurally inspired algorithms towards providing foundations for a theory of learning that is rooted in statistics and functional analysis [Jäkel et al., 2007] Machine learning = supervised learning classification, regression unsupervised learning clustering, density estimation reinforcement learning + semi-supervised learning (classification)

9 3/31 Machine learning Arthur Samuel, 1959: field of study that gives computers the ability to learn without being explicitly programmed [... ] machine learning is now an independent and mature field that has moved beyond psychologically or neurally inspired algorithms towards providing foundations for a theory of learning that is rooted in statistics and functional analysis [Jäkel et al., 2007] Machine learning = supervised learning classification, regression unsupervised learning clustering, density estimation reinforcement learning + semi-supervised learning (classification)

10 4/31 Example: Content-based spam filtering spamham (1200x287x16M jpeg)

11 5/31 Similarity. Similarity in (machine) learning similarity is fundamental to learning Shepard: in each individual there is an internal metric of similarity between possible situations [Shepard, 1987] generalization is based on similarity between situations/events/objects/... learning = generalize... (a) supervised scenarios:... from labeled to unlabeled data (b) unsupervised scenarios:... from familiar to novel data The fundamental challenge confronted by any system that is expected to generalize from familiar to unfamiliar stimuli is how to estimate similarity over stimuli in a principled and feasible manner. [Shahbazi et al., 2016]

12 5/31 Similarity. Similarity in (machine) learning similarity is fundamental to learning Shepard: in each individual there is an internal metric of similarity between possible situations [Shepard, 1987] generalization is based on similarity between situations/events/objects/... learning = generalize... (a) supervised scenarios:... from labeled to unlabeled data (b) unsupervised scenarios:... from familiar to novel data The fundamental challenge confronted by any system that is expected to generalize from familiar to unfamiliar stimuli is how to estimate similarity over stimuli in a principled and feasible manner. [Shahbazi et al., 2016]

13 6/31 Similarity of... sets, e.g. Jaccard similarity J(A, B) = A B A B sequences, e.g. edit (Levenshtein) distance-based similarity E(s, t) = 1 edist(s, t) max( s, t ) vectors, e.g. cosine similarity (= normalized dot product)... C(x, z) = x z x z complex objects, e.g. of two text segments extracted from a PDF file...

14 6/31 Similarity of... sets, e.g. Jaccard similarity J(A, B) = A B A B sequences, e.g. edit (Levenshtein) distance-based similarity E(s, t) = 1 edist(s, t) max( s, t ) vectors, e.g. cosine similarity (= normalized dot product)... C(x, z) = x z x z complex objects, e.g. of two text segments extracted from a PDF file...

15 6/31 Similarity of... sets, e.g. Jaccard similarity J(A, B) = A B A B sequences, e.g. edit (Levenshtein) distance-based similarity E(s, t) = 1 edist(s, t) max( s, t ) vectors, e.g. cosine similarity (= normalized dot product)... C(x, z) = x z x z complex objects, e.g. of two text segments extracted from a PDF file...

16 6/31 Similarity of... sets, e.g. Jaccard similarity J(A, B) = A B A B sequences, e.g. edit (Levenshtein) distance-based similarity E(s, t) = 1 edist(s, t) max( s, t ) vectors, e.g. cosine similarity (= normalized dot product)... C(x, z) = x z x z complex objects, e.g. of two text segments extracted from a PDF file...

17 6/31 Similarity of... sets, e.g. Jaccard similarity J(A, B) = A B A B sequences, e.g. edit (Levenshtein) distance-based similarity E(s, t) = 1 edist(s, t) max( s, t ) vectors, e.g. cosine similarity (= normalized dot product)... C(x, z) = x z x z complex objects, e.g. of two text segments extracted from a PDF file...

18 6/31 Similarity of... sets, e.g. Jaccard similarity J(A, B) = A B A B sequences, e.g. edit (Levenshtein) distance-based similarity E(s, t) = 1 edist(s, t) max( s, t ) vectors, e.g. cosine similarity (= normalized dot product)... C(x, z) = x z x z complex objects, e.g. of two text segments extracted from a PDF file...

19 Machine learning Similarity. Similarity in (machine) learning Kernels Semi-supervised learning and kernels A toy dataset , MACS dinner 7/31

20 8/31 (

21 9/31 Kernels o o Figure : XOR problem: separate the o s from the x s Marvin Minsky, Seymour Papert. Perceptrons: an introduction to computational geometry. MIT Press, Cambridge, Mass., a single artificial neuron/perceptron (= lin. class.) cannot solve the problem M. A. Aizerman, E. M. Braverman, L. I. Rozonoer. Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, vol. 25, pp , use kernels!

22 10/ X 1 X X 2 X 1 2 Figure : Using the polynomial kernel map the points using the function φ(x) = [x1 2 x 2 2 2x1 x 2 ] this is equivalent to using k(x, z) = φ(x), φ(z) = (x z) 2 (= polynomial kernel) polynomial kernel: link the features using logical AND (size of the group of linked features is determined by the order of the kernel)

23 11/31 Kernel methods 1909: James Mercer any continuous symmetric, positive semi-definite kernel function can be expressed as a dot product in a high-dimensional space [Mercer, 1909] 1964: Aizerman, Braverman and Rozonoer first application [Aizerman et al., 1964] 1992: Boser, Guyon and Vapnik famous application (SVM) [Boser et al., 1992] linear algorithms non-linear algorithms feature mapping: φ : X H (φ : R d1 R d2 ) kernels: k(x, z) = φ(x), φ(z) = φ(x) φ(z) covers all geometric constructions that can be formulated in terms of angles, lengths and distances Kernel trick Given an algorithm which is formulated in terms of a positive definite kernel k(, ), one can construct an alternative algorithm by replacing k(, ) by another positive definite kernel k(, ).

24 11/31 Kernel methods 1909: James Mercer any continuous symmetric, positive semi-definite kernel function can be expressed as a dot product in a high-dimensional space [Mercer, 1909] 1964: Aizerman, Braverman and Rozonoer first application [Aizerman et al., 1964] 1992: Boser, Guyon and Vapnik famous application (SVM) [Boser et al., 1992] linear algorithms non-linear algorithms feature mapping: φ : X H (φ : R d1 R d2 ) kernels: k(x, z) = φ(x), φ(z) = φ(x) φ(z) covers all geometric constructions that can be formulated in terms of angles, lengths and distances Kernel trick Given an algorithm which is formulated in terms of a positive definite kernel k(, ), one can construct an alternative algorithm by replacing k(, ) by another positive definite kernel k(, ).

25 12/31 Examples of general purpose kernels linear: k(x, z) = x z polynomial: k(x, z) = (ax z + b) c Gaussian (RBF): k(x, z) = exp ( γ x z 2)

26 12/31 Examples of general purpose kernels linear: k(x, z) = x z polynomial: k(x, z) = (ax z + b) c Gaussian (RBF): k(x, z) = exp ( γ x z 2)

27 12/31 Examples of general purpose kernels linear: k(x, z) = x z polynomial: k(x, z) = (ax z + b) c Gaussian (RBF): k(x, z) = exp ( γ x z 2)

28 Kernels and similarities kernel real-valued symmetric positive definite similarity real-valued not necessarily symmetric not necessarily p.d. k(x, z) = 1 [k(x, x) + k(z, z) 2 ] φ(x) φ(z) 2 2 sim(x, z) = inverse of the distance between x and z k(x, z) = φ(x), φ(z) = the cosine similarity of the mapped vectors, provided they are normalized 13/31

29 Kernels and similarities kernel real-valued symmetric positive definite similarity real-valued not necessarily symmetric not necessarily p.d. k(x, z) = 1 [k(x, x) + k(z, z) 2 ] φ(x) φ(z) 2 2 sim(x, z) = inverse of the distance between x and z k(x, z) = φ(x), φ(z) = the cosine similarity of the mapped vectors, provided they are normalized 13/31

30 14/31 A sample/simple method: prototype learning c + w x c c class centers (centroids, prototypes): c + = 1 N + c = 1 x i X + x i x i N x i X

31 15/31 define the following vectors: w = c + c and c = (c + + c )/2 then y(x) = sgn x c, w with b = ( c 2 c + 2) /2. = sgn ( c +, x c, x + b) using dot products between the x i s: y(x) = sgn 1 x, x i 1 x, x i + b N + N x i X + x i X where b = N 2 x i, x j 1 N x i,x j X + 2 x i, x j x i,x j X +

32 16/31 The representer theorem Theorem (Schölkopf and Smola, 2002) Let H be the feature space associated to a positive semi-definite kernel k : X X R. Denote by Ω : [0, ) R a strictly monotonic increasing function, and by c : (X R 2 ) l R { } an arbitrary loss function. Then each minimizer of the regularized risk c((x 1, y 1, f (x 1 )),..., (x l, y l, f (x l ))) + Ω( f H ) admits a representation of the form f (x) = l α i k(x i, x) i=1

33 17/31 Semiparametric representer theorem f (x) = l M α i k(x i, x) + β p ψ p (x) i=1 p=1 Loss function + regularization for the centroid classifier: y i =1 where f (x i ) = w x i + b y i f (x i ) N + N y i = 1 y i f (x i ) + N + 2 w 2 2

34 18/31 curse or blessing? Dimensionality usually: φ : R d1 R d2 with d 2 > d 1 or d 2 d 1 why? higher the dimensionality, easier to find a separating hyperplane Vapnik Chervonenkis dimension of a classification algorithm = largest set of points that the algorithm can shatter (shattering of a set of points = all possible labelings of the points can be realized by the method) VC dimension of oriented hyperplanes in R d is d + 1 (see proof in [Burges, 1998])

35 19/31 φ need not be dimensionality increaser/raiser it suffices to map the points to a better representational space in either case: Johnson Lindenstrauss lemma [Johnson and Lindenstrauss, 1984] if number of data points is relatively small (compared to dimensionality) random projection of logaritmically lower dimensionality relative distances will be approximately preserved corollary: kernels can be used for dimensionality reduction

36 20/31 199x 200y 1992: SVM The kernelization period?: kernel regularized least squares 1996: kernel PCA 1999: kernel Fisher discriminant analysis, transductive SVM 2001: kernel k-means clustering, kernel canonical correlation analysis, SVC (support vector clustering) 2005: first data-dependent non-parametric kernel, Laplacian regularized least squares, Laplacian SVM...

37 20/31 199x 200y 1992: SVM The kernelization period?: kernel regularized least squares 1996: kernel PCA 1999: kernel Fisher discriminant analysis, transductive SVM 2001: kernel k-means clustering, kernel canonical correlation analysis, SVC (support vector clustering) 2005: first data-dependent non-parametric kernel, Laplacian regularized least squares, Laplacian SVM...

38 21/31 (Some DBLP stats) Figure : works retrieved for keyword kernel on (among ) Bernhard Schölkopf (73) Johan A. K. Suykens (68) José Carlos Príncipe (63) Stefan Kratsch (60) Alessandro Moschitti (56) Alexander J. Smola (53) Hortensia Galeana-Sánchez (51) Arthur Gretton (47) Saket Saurabh (44) Edwin R. Hancock (44) Figure : Top 10 authors for the same keyword

39 22/31 Semi-supervised learning and kernels Semi-supervised learning (SSL) supervised learning: D = {(x i, y i ) x i X R d, y i { 1, +1}, i = 1,..., l}; find f : X { 1, +1} which agrees with D semi-supervised learning: D = {(x i, y i ) i = 1,..., l} {x j j = 1,..., u}, l u, N = l + u; inductive: find f : X { 1, +1} which agrees with D + use the information of D U transductive: find f : D U { 1, +1} by using D = D L D U

40 22/31 Semi-supervised learning and kernels Semi-supervised learning (SSL) supervised learning: D = {(x i, y i ) x i X R d, y i { 1, +1}, i = 1,..., l}; find f : X { 1, +1} which agrees with D semi-supervised learning: D = {(x i, y i ) i = 1,..., l} {x j j = 1,..., u}, l u, N = l + u; inductive: find f : X { 1, +1} which agrees with D + use the information of D U transductive: find f : D U { 1, +1} by using D = D L D U

41 23/31 Assumptions in SSL 1. smoothness assumption: If two points x i and x j in a high density region are close, then so should be the corresponding outputs y i and y j. 2. cluster assumption: If two points are in the same cluster, they are likely to be of the same class. 3. manifold assumption (a.k.a. graph-based learning): The high dimensional data lie roughly on a low dimensional manifold.

42 23/31 Assumptions in SSL 1. smoothness assumption: If two points x i and x j in a high density region are close, then so should be the corresponding outputs y i and y j. 2. cluster assumption: If two points are in the same cluster, they are likely to be of the same class. 3. manifold assumption (a.k.a. graph-based learning): The high dimensional data lie roughly on a low dimensional manifold.

43 23/31 Assumptions in SSL 1. smoothness assumption: If two points x i and x j in a high density region are close, then so should be the corresponding outputs y i and y j. 2. cluster assumption: If two points are in the same cluster, they are likely to be of the same class. 3. manifold assumption (a.k.a. graph-based learning): The high dimensional data lie roughly on a low dimensional manifold.

44 24/31 Humans and SSL humans do semi-supervised classification too 2007: experiment (Zhu and his colleagues), University of Wisconsin [Zhu et al., 2007] complex 3D shapes classified into two categories participants were told they see microscopic images of pollen particles from two fictitious flowers (Belianthus and Nortulaca) data given: 2 labeled examples (each appearing 10 times in 20 trials) test set of 21 evenly spaced unlabeled examples to test the learned decision boundary unlabeled examples the means are shifted away from the labeled examples (left-shifted or right-shifted) test set of 21 evenly spaced unlabeled examples to test whether the decision boundary has changed the learned decision boundary is determined by both labeled and unlabeled data

45 25/31 Data-dependent kernels supervised learning + data-dependent kernels = semi-supervised learning conventional kernels: given data sets D 1 D 2, x, z D 1 D 2 k(x, z) = k(x, z) data-dependent kernels: given data sets D 1 D 2, x, z D 1 D 2 k(x, z; D 1 ) k(x, z; D 2 ) reads as not necessarily equal

46 25/31 Data-dependent kernels supervised learning + data-dependent kernels = semi-supervised learning conventional kernels: given data sets D 1 D 2, x, z D 1 D 2 k(x, z) = k(x, z) data-dependent kernels: given data sets D 1 D 2, x, z D 1 D 2 k(x, z; D 1 ) k(x, z; D 2 ) reads as not necessarily equal

47 25/31 Data-dependent kernels supervised learning + data-dependent kernels = semi-supervised learning conventional kernels: given data sets D 1 D 2, x, z D 1 D 2 k(x, z) = k(x, z) data-dependent kernels: given data sets D 1 D 2, x, z D 1 D 2 k(x, z; D 1 ) k(x, z; D 2 ) reads as not necessarily equal

48 Reweighting cluster kernels idea borrowed from bagged cluster kernel [Weston et al., 2005] reweighting conventional kernels according to some clustering of the data [Bodó and Csató, 2010] kernel combinations: K 1 + K 2, a K, K 1 K 2 cluster kernel: K = K rw K b where K b = base kernel (e.g. Gaussian, polynomial, etc.) K rw = reweighting kernel K = resulting cluster kernel used in the learning algorithm k rw (x, z) = exp ( U x U z 2 ) 2σ 2 K rw = U U + α 11, α [0, 1) K rw = β U U + 11, β (0, ) ( U = matrix of cluster membership vectors (columns) of size }{{} K }{{} N ) no. of clusters no. of points 26/31

49 Reweighting cluster kernels idea borrowed from bagged cluster kernel [Weston et al., 2005] reweighting conventional kernels according to some clustering of the data [Bodó and Csató, 2010] kernel combinations: K 1 + K 2, a K, K 1 K 2 cluster kernel: K = K rw K b where K b = base kernel (e.g. Gaussian, polynomial, etc.) K rw = reweighting kernel K = resulting cluster kernel used in the learning algorithm k rw (x, z) = exp ( U x U z 2 ) 2σ 2 K rw = U U + α 11, α [0, 1) K rw = β U U + 11, β (0, ) ( U = matrix of cluster membership vectors (columns) of size }{{} K }{{} N ) no. of clusters no. of points 26/31

50 Reweighting cluster kernels idea borrowed from bagged cluster kernel [Weston et al., 2005] reweighting conventional kernels according to some clustering of the data [Bodó and Csató, 2010] kernel combinations: K 1 + K 2, a K, K 1 K 2 cluster kernel: K = K rw K b where K b = base kernel (e.g. Gaussian, polynomial, etc.) K rw = reweighting kernel K = resulting cluster kernel used in the learning algorithm k rw (x, z) = exp ( U x U z 2 ) 2σ 2 K rw = U U + α 11, α [0, 1) K rw = β U U + 11, β (0, ) ( U = matrix of cluster membership vectors (columns) of size }{{} K }{{} N ) no. of clusters no. of points 26/31

51 27/31 A toy dataset Figure : Linked tori dataset labeled examples: 3 + 3, unlabeled examples:

28/31 Figure : Linear SVM: Accuracy = 70.

52 28/31 Figure : Linear SVM: Accuracy = % (279/394) Figure : Gaussian SVM: γ = Accuracy = % (274/394)

53 29/31 Figure : SVM with reweighting cluster kernel (RCK) clustering: fuzzy, p = 2, no. of clusters = 30 3rd kernel β = 1000 Accuracy = % (300/394)

54 30/31 Thank you!

55 31/31 Aizerman et al., 1964 References M. A. Aizerman, E. M. Braverman, L. I. Rozoner. Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, vol. 25, pp , Bodó and Csató, 2010 Z. Bodó, L. Csató. Hierarchical and Reweighting Cluster Kernels for Semi-Supervised Learning. Int. J. of Computers, Communications & Control, Vol. V, No. 4, pp , Boser et al., 1992 Burges, 1998 Jäkel et al., 2007 B. E. Boser, I. M. Guyon, V. N. Vapnik. A Training Algorithm for Optimal Margin Classifiers. COLT, pp , C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2(4), pp , F. Jäkel, B. Schölkopf, F. A. Wichmann. A Tutorial on Kernel Methods for Categorization. Journal of Mathematical Psychology 51(6), pp , Johnson and Lindenstrauss, 1984 W. B. Johnson, J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics, 26, pp , Mercer, 1909 J. Mercer. Functions of positive and negative type and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society, Series A, vol. 209, pp , Minsky and Papert, 1969 M. Minsky, S. Papert. Perceptrons: an introduction to computational geometry. MIT Press, Cambridge, Mass., 1969 Schölkopf and Smola, 2002 B. Schölkopf, A. J. Smola. Learning with Kernels. MIT Press, Cambridge, Mass., Shahbazi et al., 2016 Shepard, 1987 Weston et al., 2005 Zhu et al., 2007 R. Shahbazi, R. Raizada, S. Edelman. Similarity, kernels, and the fundamental constraints on cognition. Journal of Mathematical Psychology, vol. 70, pp , R. N. Shepard. Toward a universal law of generalization for psychological science. Science, 237, pp , J. Weston, C. Leslie, D. Zhou, A. Elisseeff, W. S. Noble. Semi-Supervised Protein Classification using Cluster Kernels. Bioinformatics, 21(15), pp , X. Zhu, T. Rogers, R. Qian, C. Kalish. Humans perform semi-supervised classification too. AAAI, pp , 2007.

Learning. Szeged, March Faculty of Mathematics and Computer Science, Babeş Bolyai University, Cluj-Napoca/Kolozsvár

Learning. Szeged, March Faculty of Mathematics and Computer Science, Babeş Bolyai University, Cluj-Napoca/Kolozsvár Faculty of Mathematics and Computer Science, Babeş Bolyai University, Cluj-Napoca/Kolozsvár 15 10 5 Machine. Supervised and 0 5 10 15 10 5 0 5 10 15 20 25 Szeged, March 2012 1/41 2/41 Contents Machine.