This is an author-deposited version published in : Eprints ID : 17710

Size: px

Start display at page:

Download "This is an author-deposited version published in : Eprints ID : 17710"

Shon Shelton
5 years ago
Views:

Open Archive TOULOUSE Archive Ouverte (OATAO) OATAO is an open access repository that collects the work of Toulouse researchers and makes it freely available over the web where possible.

1 Open Archive TOULOUSE Archive Ouverte (OATAO) OATAO is an open access repository that collects the work of Toulouse researchers and makes it freely available over the web where possible. This is an author-deposited version published in : Eprints ID : To link to this article : DOI : /eas/ URL : To cite this version : Fauvel, Mathieu Introduction to the Kernel Methods. (2016) EAS-journal, vol.77, pp ISSN Any correspondence concerning this service should be sent to the repository administrator: staff-oatao@listes-diff.inp-toulouse.fr

2 Introduction to Kernel Methods Classification of multivariate data Mathieu Fauvel < Tue>

3 Outline Introductory example Linear case Non linear case Kernel function Positive semi-definite kernels Some kernels Kernel K-NN Support Vectors Machines Learn from data Linear SVM Non linear SVM Fitting the hyperparameters Multiclass SVM Classification of hyperspectral data

4 Inner product An inner product is a map.,. X : X X K satisfying the following axioms: Symmetry: x, y X = y, x X Bilinearity: ax + by, cz + dw X = ac x, z X + ad x, w X + bc y, z X + bd y, w X Non-negativity: x, x X 0 Positive definiteness: x, x X = 0 x = 0 The standard inner product in the Euclidean space, x R d and d N, is called the dot product: x, y R n = n i=1 x iy i.

5 Introductory example Linear case Non linear case Kernel function Positive semi-definite kernels Some kernels Kernel K-NN Support Vectors Machines Learn from data Linear SVM Non linear SVM Fitting the hyperparameters Multiclass SVM Classification of hyperspectral data

6 Introductory example Linear case Non linear case Kernel function Positive semi-definite kernels Some kernels Kernel K-NN Support Vectors Machines Learn from data Linear SVM Non linear SVM Fitting the hyperparameters Multiclass SVM Classification of hyperspectral data

7 Toy data set Suppose we want to classify the following data S = {(x 1, y 1),..., (x n, y n)}, (x i, y i ) R 2 {±1}

8 Simple classifier: Decision rule: assigns a new sample x to the class whose mean is closer to x. f (x) = sgn ( µ 1 x 2 µ 1 x 2). (1) Equation (1) can be written in the following way f (x) = sgn ( w, x + b). Identify w and b. Tips: µ 1 x 2 = µ 1 x, µ 1 x and µ 1 = 1 m 1 i=1 m x i. 1

9 Solution 1/3 µ 1 x 2 = 1 m 1 m1 i=1 x i x, = 1 m1 1 x i, m 1 = 1 m 2 1 i=1 m 1 i=1 k=1 µ 1 x 2 = 1 m 1 m 1 2 m 1 1 m 1 m 1 i=1 m 1 i=1 x i x x i + x, x 2 1 m 1 m1 i=1 x i, x x i, x k + x, x 2 1 m1 x i, x (2) m 1 j=1 l=1 Plugging (2) and (3) into (1) leads to x j, x l + x, x 2 1 i=1 m 1 m 1 x j, x (3) j=1 w, x + b = 2( 1 m 1 m1 i=1 x i 1 m 1 m 1 j=1 x j ), x 1 m 2 1 m 1 i=1 k=1 x i, x k + 1 m 2 1 m 1 x j, x l j=1 l=1

10 Solution 2/3 w = 2 m 1 +m 1 i=1 y i = 1 or 1 α i = 1 m 1 or b = 1 m 2 1 α i y i x i (4) 1 m 1 m 1 i=1 k=1 x i, x k + 1 m 2 1 m 1 x j, x l (5) j=1 l=1

11 Solution 3/

12 Introductory example Linear case Non linear case Kernel function Positive semi-definite kernels Some kernels Kernel K-NN Support Vectors Machines Learn from data Linear SVM Non linear SVM Fitting the hyperparameters Multiclass SVM Classification of hyperspectral data

13 Another toy data set Now the data are distributed a bit differently: However, if we can find a feature space where the data are linearly separable, it can still be applied. Questions: 1. Find a feature space where the data are linearly separable. 2. Try to write the dot product in the feature space in terms of input space variables.

14 Feature space Two simple feature spaces are possible: 1. Projection in the polar domain ρ = x x 2 2 θ = arctan ( x2 2. Projection in the space of monomials of order 2. φ : R 2 R 3 x φ(x) (x 1, x 2) (x 2 1, x 2 2, 2x 1x 2) x 1 )

15 Feature space associated to monomials of order 2 In R 3, the inner product can be expressed as 3 φ(x), φ(x ) R 3 = φ(x) i φ(x ) i i=1 = φ(x) 1φ(x ) 1 + φ(x) 2φ(x ) 2 + φ(x) 3φ(x ) 3 = x 2 1x x 2 2x x 1x 2x 1x 2 = (x 1x 1 + x 2x 2) 2 = x, x 2 R 2 = k(x, x ). The decision rule can be written in the input space thanks to the function k. f (x) = 2 m 1 +m 1 i=1 α i y i k(x i, x) 1 m 2 1 m 1 i=1 k=1 k(x i, x k ) + 1 m 1 m 1 2 j=1 l=1 k(x j, x l )

16 Non linear decision function Feature space (x 2 1, x 2 2) Separating function

17 Introductory example Linear case Non linear case Kernel function Positive semi-definite kernels Some kernels Kernel K-NN Support Vectors Machines Learn from data Linear SVM Non linear SVM Fitting the hyperparameters Multiclass SVM Classification of hyperspectral data

18 Conclusion A linear algorithm can be turned to a non linear one, simply by exchanging the dot product by an appropriate function. This function has to be equivalent to a dot product in a feature space. It is called a kernel function or just kernel. What are the properties of kernel?

19 Introductory example Linear case Non linear case Kernel function Positive semi-definite kernels Some kernels Kernel K-NN Support Vectors Machines Learn from data Linear SVM Non linear SVM Fitting the hyperparameters Multiclass SVM Classification of hyperspectral data

20 Introductory example Linear case Non linear case Kernel function Positive semi-definite kernels Some kernels Kernel K-NN Support Vectors Machines Learn from data Linear SVM Non linear SVM Fitting the hyperparameters Multiclass SVM Classification of hyperspectral data

21 Definition Definition (Positive semi-definite kernel) k : R d R d R is positive semi-definite is (x, x ) R d R d, k(x i, x j ) = k(x j, x i ). n N, ξ 1... ξ n R, x 1... x n R d, n i,j ξ iξ j k(x i, x j ) 0. Theorem (Moore-Aronsjan (1950)) To every positive semi-definite kernel k, there exists a Hilbert space H and a feature map φ : R d H such that for all x i, x j we have k(x i, x j ) = φ(x i ), φ(x j ) H.

22 Operations on kernels Let k 1 and k 2 be positive semi-definite, and λ 1,2 > 0 then: 1. λ 1k 1 is a valid kernel 2. λ 1k 1 + λ 2k 2 is positive semi-definite. 3. k 1k 2 is positive semi-definite. 4. exp(k 1) is positive semi-definite. 5. g(x i )g(x j ) is positive semi-definite, with g : R d R.

23 Introductory example Linear case Non linear case Kernel function Positive semi-definite kernels Some kernels Kernel K-NN Support Vectors Machines Learn from data Linear SVM Non linear SVM Fitting the hyperparameters Multiclass SVM Classification of hyperspectral data

24 Polynomial kernel The polynomial kernel of order p and bias q is defined as k(x i, x j ) = ( x i, x j + q) p ( ) p p = q p l x i, x j l. l l=1 It correspond to the feature space of monomials up to degree p. Depending on q 0, the relative weights of the higher order monomial is inscreased/deacreased.

25 Gaussian kernel The Gaussian kernel with paramater σ is defined as k(x i, x j ) = exp ( x ) i x j 2. 2σ 2 More generally, any distance can be used in the exponential rather than the Euclidean distance. For instance, the spectral angle is a valid distance: Θ(x i, x j ) = x i, x j x i x j.

26 Kernel values in R (a) (b) (a) Polynomial kernel values for p = 2 and q = 0 and x = [1, 1], (b) Gaussian kernel values for σ = 2 and x = [1, 1].

27 Introductory example Linear case Non linear case Kernel function Positive semi-definite kernels Some kernels Kernel K-NN Support Vectors Machines Learn from data Linear SVM Non linear SVM Fitting the hyperparameters Multiclass SVM Classification of hyperspectral data

28 How to construct kernel for my data? A kernel is usually seen as a measure of similarity between two samples. It reflects in some sens, how two samples are similar. In practice, it is possible to define kernels using some a priori of our data. For instance: in image classification. It is possible to build kernels that includes information from the spatial domain. Local correlation Spatial position...

29 Introductory example Linear case Non linear case Kernel function Positive semi-definite kernels Some kernels Kernel K-NN Support Vectors Machines Learn from data Linear SVM Non linear SVM Fitting the hyperparameters Multiclass SVM Classification of hyperspectral data

30 Compute distance in feature space 1/2 The K-NN decision rule is based on the distance between two samples. In the feature space, the distance is computed as φ(x i ) φ(x j ) 2 H. Write this equation in terms of kernel function. Fill the function R labwork_knn.r by adding the construction of the kernel function. Then run it.

31 Compute distance in feature space 2/2 φ(x i ) φ(x j ) 2 H = φ(x i ) φ(x j ), φ(x i ) φ(x j ) H = φ(x i ), φ(x i ) H + φ(x j ), φ(x j ) H 2 φ(x i ), φ(x j ) H = k(x i, x i ) + k(x j, x j ) 2k(x i, x j ) (a) (b) (a) KNN classification (b) Kernel KNN classification with a polynomial kernel of order 2

32 Introductory example Linear case Non linear case Kernel function Positive semi-definite kernels Some kernels Kernel K-NN Support Vectors Machines Learn from data Linear SVM Non linear SVM Fitting the hyperparameters Multiclass SVM Classification of hyperspectral data

33 Introductory example Linear case Non linear case Kernel function Positive semi-definite kernels Some kernels Kernel K-NN Support Vectors Machines Learn from data Linear SVM Non linear SVM Fitting the hyperparameters Multiclass SVM Classification of hyperspectral data

34 Learning problem Given a training set S and a loss function L, we want to find a function f from a set of functions F that minimizes its expected loss, or risk, R(f ): R(f ) = L(f (x), y)dp(x, y). (6) But P(x, y) is unknown! The empirical risk, R emp(f ) can be computed: S R emp(f ) = 1 n n L(f (x i ), y i ) (7) i=1 Convergence? f 1 minimizes R emp, then R emp(f 1 ) R(f 1 ) as n tends to infinity But f 1 is not necessarily a minimizer of R.

35 Non parametric classification Bayesian approach consists of selecting a distribution a priori for P(x, y) (GMM) In machine learning, no assumption is made as to the distribution, but only about the complexity of the class of functions F. Favor simple functions to discard over-fitting problems, to achieve a good generalization ability. Vapnik-Chervonenkis (VC) theory: the complexity is related to the number of points that can be separated by a function. R(f ) R emp(f, n) + C(f, n)

36 Illustration Trade-off between R emp and complexity VC dimension OK OK KO

37 Introductory example Linear case Non linear case Kernel function Positive semi-definite kernels Some kernels Kernel K-NN Support Vectors Machines Learn from data Linear SVM Non linear SVM Fitting the hyperparameters Multiclass SVM Classification of hyperspectral data

38 Separating hyperplane A separating hyperplane H(w, b) is a linear decision function that separate the space into two half-spaces, each half-space corresponding to the given class, i.e., sgn ( w, x i + b) = y i for all samples from S. The condition of correct classification is Many hyperplanes? y i ( w, x i + b) 1. (8) x 2 x 1

39 Optimal separating hyperplane From the VC theory, the optimal one is the one that maximize the margin The margin is inversely proportional to w 2 = w, w. Optimal parameters are found by solving the convex optimization problem w, w minimize 2 subject to y i ( w, x i + b) 1, i 1,..., n. The problem is traditionally solved by considering soft margin constraints: y i ( w, x i + b) 1 + ξ i minimize w, w 2 + C n i=1 subject to y i ( w, x i + b) 1 ξ i, i 1,..., n ξ i ξ i 0, i 1,..., n.

40 Quadratic programming The previous problem is solved by considering the Lagrangian L(w, b, ξ, α, β) = w, w 2 + C n ξ i + i=1 n α i (1 ξ i y i ( w, x i + b)) i=1 Minimizing with respect to the primal variables and maximizing w.r.t the dual variables leads to the so-called dual problem: n max g(α) = α i 1 α 2 i=1 subject to 0 α i C n α i y i = 0. i=1 n α i α j y i y j x i, x j i,j=1 n β i ξ i w = n i=1 α iy i x i, only some of the α i are non zero. Thus w is supported by some training samples those with non-zero optimal α i. These are called the support vectors. i=1

41 Visual solution of the SVM x 2 w w x + b = 1 w x + b = 0 2 w w x + b = 1 x 1 b w

42 Introductory example Linear case Non linear case Kernel function Positive semi-definite kernels Some kernels Kernel K-NN Support Vectors Machines Learn from data Linear SVM Non linear SVM Fitting the hyperparameters Multiclass SVM Classification of hyperspectral data

43 Kernelization of the algorithm It is possible to extend the linear SVM to non linear SVM by switching the dot product to a kernel function: n max g(α) = α i 1 α 2 i=1 subject to 0 α i C n α i y i = 0. i=1 n α i α j y i y j k(x i, x j ) Now, the SVM is a non-linear classifier in the input space R d, but is still linear in the feature space the space induced by the kernel function. The decision function is simply: ( n ) f (x) = sgn α i y i k(x, x i ) + b i=1 i,j=1

44 Toy example with the Gaussian kernel C = 1 C = 100

45 Comparison with GMM

46 Introductory example Linear case Non linear case Kernel function Positive semi-definite kernels Some kernels Kernel K-NN Support Vectors Machines Learn from data Linear SVM Non linear SVM Fitting the hyperparameters Multiclass SVM Classification of hyperspectral data

47 Cross-validation Crucial step: improve or decrease drastically the performances of SVM Cross validation is conventionally used. CV estimates the expected error R. Test Train Exp. 1 R 1 emp Exp. 2 R 2 emp Exp. 3 R 3 emp Exp. k R k emp R(p) 1 k k i=1 Ri emp Good behavior in various supervised learning problem but high computational load. Test 10 values with k = 5 50 learning steps. But it can be perform in parallel...

48 Introductory example Linear case Non linear case Kernel function Positive semi-definite kernels Some kernels Kernel K-NN Support Vectors Machines Learn from data Linear SVM Non linear SVM Fitting the hyperparameters Multiclass SVM Classification of hyperspectral data

49 Collection of binary classifiers One versus All: m binary classifiers One versus One: m(m-1)/2 classifiers

50 Introductory example Linear case Non linear case Kernel function Positive semi-definite kernels Some kernels Kernel K-NN Support Vectors Machines Learn from data Linear SVM Non linear SVM Fitting the hyperparameters Multiclass SVM Classification of hyperspectral data

51 Toy non linear data set Run the code toy_svm.r The default classification is done with a Gaussian kernel: try to do it with a polynomial kernel Check the influence of each hyperparameters for the Gaussian and Polynomial kernel For the Gaussian kernel and a given set of hyperparameters Get the number of support vectors Plot them Conclusions?

52 Simulated data Simulated reflectance of Mars surface (500 to 5200 nm) The model has 5 parameters (Sylvain Douté): the grain size of water and CO 2 ice, the proportion of water, CO 2 ice and dust. x R 184 and n = Fives classes according to the grain size of water In this labwork, we are going to use the R package e1071 that use the C++ library libsvm, the state of the art QP solver.

53 Questions Using the file main_svm.r, classify the data set with SVM with a Gaussian kernel, K-NN and Kernel KNN (with a polynomial kernel) Implement the cross-validation for SVM, to select the optimal hyperparameters (C,σ) Compute the confusion matrix for each methods and look at the classification accuracy

54 Load the data ## Load some library library("e1071") load("astrostat.rdata") n=nrow(x) d=ncol(x) C = max(y) numbertrain = 100 # Select "numbertrain" per class for training numbertest = 6300-numberTrain # The remaining is for validation ## Initialization of the training/validation sets xt = matrix(0,numbertrain*c,d) yt = matrix(0,numbertrain*c,1) xt = matrix(0,numbertest*c,d) yt = matrix(0,numbertest*c,1) for (i in 1:C) { t = which(y==i) ts = sample(t) # Permute ramdomly the samples for class i xt[(1+numbertrain*(i-1)):(numbertrain*i),]=x[ts[1:numbertrain],] yt[(1+numbertrain*(i-1)):(numbertrain*i),]=y[ts[1:numbertrain],] } xt[(1+numbertest*(i-1)):(numbertest*i),]=x[ts[(numbertrain+1):6300],] yt[(1+numbertest*(i-1)):(numbertest*i),]=y[ts[(numbertrain+1):6300],]

55 Perform a simple classification ## Learn the model model = svm(xt,yt,cost=1,gamma=0.001,type="c",cachesize=512) ## Predict the validation samples yp = predict(model,xt) ## Confusion matrix confu = table(yt,yp) OA = sum(diag(confu))/sum(confu)

Jeff Howbert Introduction to Machine Learning Winter

Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1 Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable