Multiple Kernel Learning

Size: px

Start display at page:

Download "Multiple Kernel Learning"

Jordan Simmons
5 years ago
Views:

1 CS 678A Course Project Vivek Gupta, 1 Anurendra Kumar 2 Sup: Prof. Harish Karnick 1 1 Department of Computer Science and Engineering 2 Department of Electrical Engineering Indian Institute of Technology, Kanpur

2 Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program

3 Motivation Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program

4 Motivation Motivation Why do we need? Automatic model selection: SVM application requires to choose a kernel, a non-intuitive problem. Multimodal Data: Often data are from heterogeneous sources for e.g. consider a video data with subtitle. It contains video features, audio feature and text features and each set of features require different notion of similarity.

5 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program Instead of using a single kernel,a convex combination of K kernels i.e. k(x i, x j )= K 1 β kk k (x i, x j ) (1) with β k 0 and K 1 β k =1, where each kernel k k requires only a subset of features. If we choose appropriate kernels k k and find sparse weighing β k,decision function and feature selection can easily be implemented, which is missing in current kernel based algorithms.

6 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program

7 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program Primal Problem We are given N data points (x i, y i ) (y i ±1) and K mappings φ k (x) R D k from the input into K feature spaces φ 1 (x 1 ), φ 2 (x 2 ),...φ K (x K ) where D k is the dimensionality of K feature spaces. The primal problem is 1 K min w k 2 ( w k ) 2 + C k=1 N i=1 ξ i w.r.t. w k R D k, ξ R n, b R s.t. ξ i 0 and y i ( k=n k=1 w k, φ K (x i ) 1- ξ i i = 1, 2...N Bach showed that the solution can be written as w k = β k w k with β k 0 and N k=1 β k=1.the solution for β is sparse(l 1 norm) and w is not sparse(l 2 norm).

8 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program

9 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program Formulating the dual and using Epigraph technique we can write min γ w.r.t. γ R α R N s.t. 0 α 1.C N i=1 α iy i = 0 S k (α) = 1 N 2 i,j=1 α iα j y i y j k k (x i, x j ) i=n i=1 α i γ, k = 1,...K.

10 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program

11 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program The above problem is equivalent to following saddlepoint problem: maxminl = γ + β k α K β k (S k (α) γ) k=1 s.t. 0 α 1.C ξ i 0 and N i=1 α iy i = 0.

12 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program Saddle Point Problem Setting the derivative to 0 and substituting the value of β,we get the following simplified equation: : max β k min α K β k (S k (α) k=1 s.t. 0 α 1.C 0 β N i=1 α iy i = 0 and N i=1 β i = 1.

13 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program

14 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program Semi-Infinite Linear Program Again using the epigraph technique w.r.t. α we get following SILP : max θ w.r.t. θ R, β R K s.t. 0 β N i=1 β i = 1, α R N with 0 α 1.C K k=1 β k(s k (α) θ and N i=1 α iy i = 0. This is a linear program in θ and β with infiinetly many constraints, one for each α.silp algorithms such as exchange methods and wrapper algorithm are used to solve these SILP.

15 Machine Learning Toolbox A free open source Toolbox originally designed for Large Scale Kernel Methods and bioinformatics. Large number of kernels including string kernels. Modular and optimized for a very large number of examples and hundreds of kernels to be combined. Allows easy combination of multiple data representations, algorithm classes, and general purpose tools. Originally written in C++ but unified interface available for C++, Python, Octave, R, Java, Lua, C,Matlab. Algorithms: HMM,LDA, LPM,Perceptron,SVR...and many more

16 Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program

17 Gaussian kernels of width= 2,5,7 and 10 are used Figure: Binary Classification

18 Kernel weights with varying width

19 Error with varying width between circles

20 Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program

21 Datasets: Four Gaussian with different means and different covariance Matrix. Kernels: Two gaussian with different width

23 Performance Comparison Kernel used:gaussian(width=0.25),gaussian(width=25) Result Figure: Decision Boundaries with varying kernel..m.k.l,gaussian(0.25),gaussian(25)(from left) MKL Gaussian(0.25) Gaussian(25) 89.43

24 Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program

25 Figure: Two gaussian dataset approaching and then drifting away

26 Kernel Weight Comparison Kernels used :Gausian(width=0.5) and Gaussian(width=200)

27 Error Comparison Kernels used :Gausian(width=0.5) and Gaussian(width=200)

28 Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program

29 on Different Datasets Figure: Closely Spaced Concentric Circles

30 Figure: Far Spaced Concentric Circles

31 Figure: Moon

32 Figure: Noisy(5%) Blobs

33 Figure: Linear Dataset

34 Figure: Moon with high noise(40%)

35 Figure: Circles with high noise(40%)

36 Figure: Linear Separable with High noise

37 Kernel Weight Comparison Multi-kernel weighs different kernels for different dataset automatically selects the good model. Kernels: Gaussian(width=1),Polynomial(degree=4),Sigmoid,Linear Dataset1 9.99e e e-10 Dataset2 2.77e e e e-07 Dataset3 8.77e e e e-01 Dataset4 8.08e e e e-01 Dataset5 9.85e e e e-07 Dataset6 7.18e e e e-01 Dataset7 9.99e e e e-08 Dataset8 8.44e e e e-01

38 Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program

39 Data Coming from Heterogenous Sources Dataset from Linear and Radial Sources

40 Data Coming from Heterogenous Sources Figure: One source is linear and other is gaussian

41 Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program

42 Regression on sine with varying frequency

43 Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program

44 Classification on Real Datasets Dataset1: USPS Handwritten digit data Description: 4650 Training and test examples with ten classes each corresponding with one digit Kernel used:polynomial(degree=2),gaussian(width=15) Result MKL 94.9 Gaussian 93.1 Polynomial 93.5

45 Classification on Real Datasets Dataset2:Ionosphere Dataset from UCI repository Description: 280 Training and 71 Test data(10 fold) Kernel used:polynomial(degree=2),gaussian(width=15) Result MKL 94.1 Gaussian 92.1 Polynomial 91

46 automatically learns the efficient weighted distribution of kernels. gives less generalized error than any of the kernels, independent of data distribution and separability. also learns efficiently for outliers and noisy data. can learn data coming from different heterogenous sources i.e. Multimodal S

47 Apply on real multi-modal datasets such as video with audio and subtitles. Experiment with non-convex and non-linear combination of kernels

48 fbach/skm i cml.pdf http : // toolbox.org/

49 Thank You!

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Linear vs Non-linear classifier CS789: Machine Learning and Neural Network Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Linear classifier is in the