CS 678A Course Project Vivek Gupta, 1 Anurendra Kumar 2 Sup: Prof. Harish Karnick 1 1 Department of Computer Science and Engineering 2 Department of Electrical Engineering Indian Institute of Technology, Kanpur
Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program 3 4 5 6 7
Motivation Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program 3 4 5 6 7
Motivation Motivation Why do we need? Automatic model selection: SVM application requires to choose a kernel, a non-intuitive problem. Multimodal Data: Often data are from heterogeneous sources for e.g. consider a video data with subtitle. It contains video features, audio feature and text features and each set of features require different notion of similarity.
Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program Instead of using a single kernel,a convex combination of K kernels i.e. k(x i, x j )= K 1 β kk k (x i, x j ) (1) with β k 0 and K 1 β k =1, where each kernel k k requires only a subset of features. If we choose appropriate kernels k k and find sparse weighing β k,decision function and feature selection can easily be implemented, which is missing in current kernel based algorithms.
Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program 3 4 5 6 7
Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program Primal Problem We are given N data points (x i, y i ) (y i ±1) and K mappings φ k (x) R D k from the input into K feature spaces φ 1 (x 1 ), φ 2 (x 2 ),...φ K (x K ) where D k is the dimensionality of K feature spaces. The primal problem is 1 K min w k 2 ( w k ) 2 + C k=1 N i=1 ξ i w.r.t. w k R D k, ξ R n, b R s.t. ξ i 0 and y i ( k=n k=1 w k, φ K (x i ) 1- ξ i i = 1, 2...N Bach showed that the solution can be written as w k = β k w k with β k 0 and N k=1 β k=1.the solution for β is sparse(l 1 norm) and w is not sparse(l 2 norm).
Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program 3 4 5 6 7
Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program Formulating the dual and using Epigraph technique we can write min γ w.r.t. γ R α R N s.t. 0 α 1.C N i=1 α iy i = 0 S k (α) = 1 N 2 i,j=1 α iα j y i y j k k (x i, x j ) i=n i=1 α i γ, k = 1,...K.
Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program 3 4 5 6 7
Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program The above problem is equivalent to following saddlepoint problem: maxminl = γ + β k α K β k (S k (α) γ) k=1 s.t. 0 α 1.C ξ i 0 and N i=1 α iy i = 0.
Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program Saddle Point Problem Setting the derivative to 0 and substituting the value of β,we get the following simplified equation: : max β k min α K β k (S k (α) k=1 s.t. 0 α 1.C 0 β N i=1 α iy i = 0 and N i=1 β i = 1.
Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program 3 4 5 6 7
Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program Semi-Infinite Linear Program Again using the epigraph technique w.r.t. α we get following SILP : max θ w.r.t. θ R, β R K s.t. 0 β N i=1 β i = 1, α R N with 0 α 1.C K k=1 β k(s k (α) θ and N i=1 α iy i = 0. This is a linear program in θ and β with infiinetly many constraints, one for each α.silp algorithms such as exchange methods and wrapper algorithm are used to solve these SILP.
Machine Learning Toolbox A free open source Toolbox originally designed for Large Scale Kernel Methods and bioinformatics. Large number of kernels including string kernels. Modular and optimized for a very large number of examples and hundreds of kernels to be combined. Allows easy combination of multiple data representations, algorithm classes, and general purpose tools. Originally written in C++ but unified interface available for C++, Python, Octave, R, Java, Lua, C,Matlab. Algorithms: HMM,LDA, LPM,Perceptron,SVR...and many more
Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program 3 4 5 6 7
Gaussian kernels of width= 2,5,7 and 10 are used Figure: Binary Classification
Kernel weights with varying width
Error with varying width between circles
Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program 3 4 5 6 7
Datasets: Four Gaussian with different means and different covariance Matrix. Kernels: Two gaussian with different width
Performance Comparison Kernel used:gaussian(width=0.25),gaussian(width=25) Result Figure: Decision Boundaries with varying kernel..m.k.l,gaussian(0.25),gaussian(25)(from left) MKL 92.26 Gaussian(0.25) 87.40 Gaussian(25) 89.43
Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program 3 4 5 6 7
Figure: Two gaussian dataset approaching and then drifting away
Kernel Weight Comparison Kernels used :Gausian(width=0.5) and Gaussian(width=200)
Error Comparison Kernels used :Gausian(width=0.5) and Gaussian(width=200)
Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program 3 4 5 6 7
on Different Datasets Figure: Closely Spaced Concentric Circles
Figure: Far Spaced Concentric Circles
Figure: Moon
Figure: Noisy(5%) Blobs
Figure: Linear Dataset
Figure: Moon with high noise(40%)
Figure: Circles with high noise(40%)
Figure: Linear Separable with High noise
Kernel Weight Comparison Multi-kernel weighs different kernels for different dataset automatically selects the good model. Kernels: Gaussian(width=1),Polynomial(degree=4),Sigmoid,Linear Dataset1 9.99e-01 2.09-08 1.02e-10 1.05e-10 Dataset2 2.77e-03 9.97e-01 8.75e-07 9.10e-07 Dataset3 8.77e-01 8.27e-05 1.95e-06 1.23e-01 Dataset4 8.08e-01 9.56e-05 1.31e-07 1.91e-01 Dataset5 9.85e-04 9.99e-01 1.40e-07 6.13e-07 Dataset6 7.18e-01 1.79e-01 6.36e-08 1.02e-01 Dataset7 9.99e-01 2.33e-06 1.79e-08 3.10e-08 Dataset8 8.44e-01 1.61e-07 6.63e-09 1.55e-01
Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program 3 4 5 6 7
Data Coming from Heterogenous Sources Dataset from Linear and Radial Sources
Data Coming from Heterogenous Sources Figure: One source is linear and other is gaussian
Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program 3 4 5 6 7
Regression on sine with varying frequency
Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program 3 4 5 6 7
Classification on Real Datasets Dataset1: USPS Handwritten digit data Description: 4650 Training and test examples with ten classes each corresponding with one digit Kernel used:polynomial(degree=2),gaussian(width=15) Result MKL 94.9 Gaussian 93.1 Polynomial 93.5
Classification on Real Datasets Dataset2:Ionosphere Dataset from UCI repository Description: 280 Training and 71 Test data(10 fold) Kernel used:polynomial(degree=2),gaussian(width=15) Result MKL 94.1 Gaussian 92.1 Polynomial 91
automatically learns the efficient weighted distribution of kernels. gives less generalized error than any of the kernels, independent of data distribution and separability. also learns efficiently for outliers and noisy data. can learn data coming from different heterogenous sources i.e. Multimodal S
Apply on real multi-modal datasets such as video with audio and subtitles. Experiment with non-convex and non-linear combination of kernels
http://www.jmlr.org/papers/volume7/sonnenburg06a/sonnenburg06a.pdf http://www.di.ens.fr/ fbach/skm i cml.pdf http : //www.shogun toolbox.org/ http://www.shogun-toolbox.org/doc/en/3.0.0/index.html http://www.gaussianprocess.org/gpml/data/ https://archive.ics.uci.edu/ml/datasets/ionosphere
Thank You!