Kernel Methods & Support Vector Machines

Size: px

Start display at page:

Download "Kernel Methods & Support Vector Machines"

Angela Hardy
5 years ago
Views:

1 Kernel Methods & Support Vector Machines Mahdi pakdaman Naeini PhD Candidate, University of Tehran Senior Researcher, TOSAN Intelligent Data Miners

2 Outline Motivation Introduction to pattern recognition i and Machine Learning Introduction to Kernels Spars kernel methods (SVM) Anomaly detection using kernel methods 2

3 Motivation Fraud detection s perspectives Fast recall time of the learner Binary class classification One-class classification Generalization performance of the kernel methods Different kinds of information can be used Good performance in high dimensional feature space Using Linear learning typically has nice properties Unique optimal solutions Fast learning algorithms Better statistical analysis 3

4 Introduction Data can exhibit regularities that may or may not be immediately apparent exact patterns e.g. motions of planets complex patterns e.g. genes in DNA probabilistic patterns e.g. market research Detecting patterns makes it possible to understand and/or exploit the regularities to make predictions Machine Learning and Pattern Recognition is the study of automatic detection of patterns in data 4

5 Pattern Defining Generative Models Parametric Models Using Maximum entropy Model, GMM, Non Parametric Models Histogram Based Methods, KNN, Parzen estimate,.. Discriminative models Linear and non-linear discriminant Models Linear Regression Neural lnetworks SVM. 5

6 Historical perspective p Minsky and Pappert highlighted the weakness in their book Perceptrons Neural networks overcame the problem by gluing together many linear units with non-linear activation functions Solved problem of capacity and led to very impressive extension of applicability of learning But ran into training problems of speed and multiple local minima 6

7 Kernel methods approach The kernel methods approach is to stick with linear functions but work in a high dimensional feature space: The expectation is that the feature space has a much higher dimension than the input space. 7

8 Form of the functions So kernel methods use linear functions in a feature space: For regression this could be the function For classification require thresholding 8

9 Example Consider the mapping Φ : R 2 R 3 (X 1, X 2 ) (Z 1, Z 2, Z 3 ) = (X 12, 2 ½ X 1 X 2, X 22 ) If we consider a linear equation in this feature space: We actually have an ellipse i.e. a non-linear shape in the input space. 9

10 Examples of Kernels (III) Polynomial kernel (n=2) RBF kernel 10

11 Capacity of feature spaces The capacity is proportional to the dimension for example: 2-dim: 11

12 Problems of high dimensions Computational costs involved in dealing with large vectors Kernel Function & Kernel trick Capacity may easily become too large and lead to over-fitting: being able to realise every classifier means unlikely to generalise well Large Margin trick 12

13 Different Perspective: Learning & Similarity Input / output sets X, Y Training i gset( (x 1, y 1 ),.,( (x m,y m ) є XxY Generalization : given a previously unseen x є X, find a suitable y є Y (x,y) should be similar to (x 1, y 1 ),.,(x m,y m ) How to measure similarity? For outputs: loss function : (e.g. for yє{-1, +1}, zer-one loss) For inputs : kernel function 13

14 Similarity of Inputs Symmetric function k : X x X R (x, x ) k(x,x ) For example : if X = R n : canonical dot product If X is not a dot product space: assume that k has a representation as a dot product in a linear space H. egthere e.g is a mapping Φ :X H such that: In that case, we can think of the patterns as and, also carry out geometric algorithms in in dot product space (feature space) H 14

15 An Example of Kernel Method Idea: classify points in feature space according to which h of the two class means is closer: Compute the sign of dot product between w := C + - C - and X-C 15

16 An Example: Cnt d Provides geometric interpretation of Parzen windows The decision i function is a hyperplane 16

17 Example: All Degree 2 Monomial Φ : R 2 R 3 (X 1, X 2 ) (Z 1, Z 2, Z 3 ) = (X 12, 2 ½ X 1 X 2, X 22 ) 17

18 The kernel Trick The dot product in H can be computed in R 2 18

19 Kernel Trick : Cnt d More Generally: Where maps into the space spanned by all ordered products of d input directions 19

20 Kernel : more formal definition A kernel k(xy) k(x,y) is a similarity measure defined by an implicit mapping f, f from the original space to a vector space (feature space) such that: k(x,y)=f(x) f(y) ( ( ) (y) This similarity measure and the mapping include: Invariance or other a priori knowledge Simpler structure (linear representation of the data) Possibly infinite dimension (hypothesis space for learning) but still computational efficiency when computing k(x,y) Different kind of data eg. string, set, graph, tree, text,. 20

21 Valid Kernels The function k(x,y) ( is a valid kernel, if there exists a mapping f into a vector space (with a dot-product) such that k can be expressed as k(x,y)=f(x) f(y) Theorem: k(x,y) is a valid kernel if k is positive definite and symmetric (Mercer Kernel) K( x, y) f ( x) f ( y) dxdy 0 f L2 A function is P.D. if In other words, the Gram matrix K (whose elements are k(x i,x j )) must be positive definite for all a i, a j of the input space 21

22 How to build new kernels Kernel combinations, preserving validity: 1 0 ) ( ) (1 ) ( ) ( 2 1 y x y x y x, K, K, K ) ( ) ( ) ( 0 ) (. ) ( ) ( ) ( ) ( ) ( y x y x y x y x y x y y y K K K a, K a, K ) ( ). ( ) ( ) ( ). ( ) ( 2 1 y x y x y x y x function valued real is f y f x f, K, K, K, K ) ( )) ( ) ( ( ) ( 3 y x y x y φ x φ y x positive definite symmetric P P, K, K, K ) ( ) ( ) ( ) ( 1 y y x x y x y x K K, K, K ) ( ) ( 1 1 y y x x, K, K 22

23 Introduction ti to primal and Dual form of learning function 23

24 Linear Regression Given training i data: x,, x,,, x,,, x, S y y y y n points x R and labels y R Construct linear function that best interpolates a given training set i g( x) w, x w' x wx i n i1 i i i i 24

25 Linear Regression example w, x y 25

26 Ridge Regression Inverse typically does not exist. Use least norm solution for fixed Regularized problem Optimality Condition: L 2 2 min w L ( w, S) w yxw 0. ( w, S) w 2w2 X' y 2 XXw ' 0 XX ' I w Xy ' Requires O(n 3 ) operations n 26

27 Ridge Regression (cont) Inverse always exists for any ' 1 ' y w X X I X y 0. Alternative representation: 1 XX ' Iw Xy ' w Xy ' XXw ' 1 w X' y Xw X' α ' 1 α y Xw α y Xw y XX α XX ' αα y α G I 1 y where G XX' Solving llequation is 0(l 3 ) 27

28 Dual Ridge Regression To predict new point: ( ),, ' 1 g x w x ixi x y G I z where wee z x, x i i1 Note need only compute G, the Gram Matrix G XX' x, x G ij i j Ridge Regression requires only inner products between data points 28

29 Efficiency To compute w in primal ridge regression is O(n 3 ) in primal ridge regression is O(l 3 ) To predict new point x n primal g( x) w, x w i x O(n) i i1 n dual g() x wx, x i i, x i x i x j j i 1 i 1 j 1 Dual is better if n>>l O(nl) 29

30 Key ingredient of dual solution Step 1: α = (G + λ I) -1 y G=XX X.X G ij = <x i,x j > Step 2: Important observation : Both steps only involve inner products between input data points 30

31 Kernel Trick: Summery Any Learning Algorithm that only depends on dot products can benefit from kernel trick This way, we can apply linear methods to vectorial as well as non-vectorial data Think of kernels as non linear similarity measures Example of common kernels: 31

32 Kernel Method Application I Sparse Kernels (SVM) 32

33 Sparse Kernels Method: Support Vector Classifier 33

34 Separating Hypreplane p 34

35 Optimal Separating Hypeplane p 35

36 Eliminating the scaling freedom 36

37 Canonical Optimal Hyperplane p 37

38 Formulation as an Optimization Problem 38

39 Lagrange g Function 39

40 Derivation of Dual Problem 40

41 The Support Vector Expansion 41

42 Dual problem 42

43 Example: Gaussian kernel 43

44 Non-Separable data

45 Soft Margin SVMs 45

46 The v-property p 46

47 Duals Using Kernels 47

48 SVM Training 48

49 Kernel Method Application II Anomaly Detection using Kernel Methods 49

50 One-Class SVM Finding the smallest hyper sphere containing the most training data (Y. Chen et al. 2001) Mapping data into feature space finding the maximum margin line separating the origin from the mapped data (Sholkopf et al. 1999) 50

51 Chen s Method 51

52 Scholkopf Method: Separating unlabeled data from the origin 52

53 V-Soft Margin Separation 53

54 Dual problem 54

55 Remark These two methods would be the same if the kernel K(x,y) just depends on x-y Scholkopf et al

56 Conclusion Kernel methods provide a general purpose toolkit for pattern analysis Advantages kernels define flexible interface to the data enabling the user to encode prior knowledge into a measure of similarity between two items with the proviso that it must satisfy the psd property. algorithms well-founded in statistical learning theory enable efficient and effective exploitation of the high-dimensional representations to enable good off-training performance. Subspace methods can often be implemented in kernel defined feature spaces using dual representations Overall gives a generic plug and play framework for analysing data, combining different data types, models, tasks, and preprocessing We can accommodate different kind of data in learning problems using kernel methods Convex optimization problem: local minima is globally optimum Disadvantages The choose of kernel is somewhat heuristic and depend on the application you face Risk of encountering overfitting 56

57 هرگزنميرد آنكه دلش زنده شده به عشق ثبت است بر جريده عالم دوام ما

58 Thank you!

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Linear vs Non-linear classifier CS789: Machine Learning and Neural Network Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Linear classifier is in the