Multiple Kernel Learning

Similar documents
Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Jeff Howbert Introduction to Machine Learning Winter

Introduction to Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Mathematical Programming for Multiple Kernel Learning

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

MULTIPLEKERNELLEARNING CSE902

CIS 520: Machine Learning Oct 09, Kernel Methods

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Support Vector Machines

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Kernel Methods and Support Vector Machines

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

ML (cont.): SUPPORT VECTOR MACHINES

CS798: Selected topics in Machine Learning

Statistical Machine Learning from Data

Machine Learning. Support Vector Machines. Manfred Huber

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Introduction to Support Vector Machines

What is semi-supervised learning?

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

L5 Support Vector Classification

Classifier Complexity and Support Vector Classifiers

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Support Vector Machines and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

CS145: INTRODUCTION TO DATA MINING

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Constrained Optimization and Support Vector Machines

Lecture 10: A brief introduction to Support Vector Machine

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

Kernel methods CSE 250B

Lecture 7: Kernels for Classification and Regression

Applied inductive learning - Lecture 7

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Introduction to SVM and RVM

Introduction to Machine Learning

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Lecture Support Vector Machine (SVM) Classifiers

Support Vector Machine (SVM) and Kernel Methods

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Linear & nonlinear classifiers

Statistical learning theory, Support vector machines, and Bioinformatics

Linear & nonlinear classifiers

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machines

Review: Support vector machines. Machine learning techniques and image analysis

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Support Vector Machine (continued)

Support Vector Machines: Kernels

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Relevance Vector Machines

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

The definitions and notation are those introduced in the lectures slides. R Ex D [h

Support Vector Machine

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

FACTORIZATION MACHINES AS A TOOL FOR HEALTHCARE CASE STUDY ON TYPE 2 DIABETES DETECTION

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Nearest Neighbors Methods for Support Vector Machines

Support Vector Machine II

Support Vector Machines

Training algorithms for fuzzy support vector machines with nois

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Basis Expansion and Nonlinear SVM. Kai Yu

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Infinite Ensemble Learning with Support Vector Machinery

Support Vector Machines and Speaker Verification

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Support Vector Machine & Its Applications

Lecture 10: Support Vector Machine and Large Margin Classifier

ECE521 week 3: 23/26 January 2017

This is an author-deposited version published in : Eprints ID : 17710

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Discriminative Learning and Big Data

1. Kernel ridge regression In contrast to ordinary least squares which has a cost function. m (θ T x (i) y (i) ) 2, J(θ) = 1 2.

Support Vector Machines, Kernel SVM

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Support Vector Machines for Classification and Regression

Support Vector Machine (SVM) and Kernel Methods

Applied Machine Learning Annalisa Marsico

Learning with kernels and SVM

Kernel Learning for Multi-modal Recognition Tasks

Support Vector Machine for Classification and Regression

Kernel Methods & Support Vector Machines

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Lecture Notes on Support Vector Machine

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

SVMs: nonlinearity through kernels

Kernels and the Kernel Trick. Machine Learning Fall 2017

(Kernels +) Support Vector Machines

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Learning SVM Classifiers with Indefinite Kernels

CS-E4830 Kernel Methods in Machine Learning

Support Vector Machines.

GWAS V: Gaussian processes

Transcription:

CS 678A Course Project Vivek Gupta, 1 Anurendra Kumar 2 Sup: Prof. Harish Karnick 1 1 Department of Computer Science and Engineering 2 Department of Electrical Engineering Indian Institute of Technology, Kanpur

Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program 3 4 5 6 7

Motivation Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program 3 4 5 6 7

Motivation Motivation Why do we need? Automatic model selection: SVM application requires to choose a kernel, a non-intuitive problem. Multimodal Data: Often data are from heterogeneous sources for e.g. consider a video data with subtitle. It contains video features, audio feature and text features and each set of features require different notion of similarity.

Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program Instead of using a single kernel,a convex combination of K kernels i.e. k(x i, x j )= K 1 β kk k (x i, x j ) (1) with β k 0 and K 1 β k =1, where each kernel k k requires only a subset of features. If we choose appropriate kernels k k and find sparse weighing β k,decision function and feature selection can easily be implemented, which is missing in current kernel based algorithms.

Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program 3 4 5 6 7

Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program Primal Problem We are given N data points (x i, y i ) (y i ±1) and K mappings φ k (x) R D k from the input into K feature spaces φ 1 (x 1 ), φ 2 (x 2 ),...φ K (x K ) where D k is the dimensionality of K feature spaces. The primal problem is 1 K min w k 2 ( w k ) 2 + C k=1 N i=1 ξ i w.r.t. w k R D k, ξ R n, b R s.t. ξ i 0 and y i ( k=n k=1 w k, φ K (x i ) 1- ξ i i = 1, 2...N Bach showed that the solution can be written as w k = β k w k with β k 0 and N k=1 β k=1.the solution for β is sparse(l 1 norm) and w is not sparse(l 2 norm).

Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program 3 4 5 6 7

Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program Formulating the dual and using Epigraph technique we can write min γ w.r.t. γ R α R N s.t. 0 α 1.C N i=1 α iy i = 0 S k (α) = 1 N 2 i,j=1 α iα j y i y j k k (x i, x j ) i=n i=1 α i γ, k = 1,...K.

Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program 3 4 5 6 7

Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program The above problem is equivalent to following saddlepoint problem: maxminl = γ + β k α K β k (S k (α) γ) k=1 s.t. 0 α 1.C ξ i 0 and N i=1 α iy i = 0.

Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program Saddle Point Problem Setting the derivative to 0 and substituting the value of β,we get the following simplified equation: : max β k min α K β k (S k (α) k=1 s.t. 0 α 1.C 0 β N i=1 α iy i = 0 and N i=1 β i = 1.

Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program 3 4 5 6 7

Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program Semi-Infinite Linear Program Again using the epigraph technique w.r.t. α we get following SILP : max θ w.r.t. θ R, β R K s.t. 0 β N i=1 β i = 1, α R N with 0 α 1.C K k=1 β k(s k (α) θ and N i=1 α iy i = 0. This is a linear program in θ and β with infiinetly many constraints, one for each α.silp algorithms such as exchange methods and wrapper algorithm are used to solve these SILP.

Machine Learning Toolbox A free open source Toolbox originally designed for Large Scale Kernel Methods and bioinformatics. Large number of kernels including string kernels. Modular and optimized for a very large number of examples and hundreds of kernels to be combined. Allows easy combination of multiple data representations, algorithm classes, and general purpose tools. Originally written in C++ but unified interface available for C++, Python, Octave, R, Java, Lua, C,Matlab. Algorithms: HMM,LDA, LPM,Perceptron,SVR...and many more

Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program 3 4 5 6 7

Gaussian kernels of width= 2,5,7 and 10 are used Figure: Binary Classification

Kernel weights with varying width

Error with varying width between circles

Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program 3 4 5 6 7

Datasets: Four Gaussian with different means and different covariance Matrix. Kernels: Two gaussian with different width

Performance Comparison Kernel used:gaussian(width=0.25),gaussian(width=25) Result Figure: Decision Boundaries with varying kernel..m.k.l,gaussian(0.25),gaussian(25)(from left) MKL 92.26 Gaussian(0.25) 87.40 Gaussian(25) 89.43

Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program 3 4 5 6 7

Figure: Two gaussian dataset approaching and then drifting away

Kernel Weight Comparison Kernels used :Gausian(width=0.5) and Gaussian(width=200)

Error Comparison Kernels used :Gausian(width=0.5) and Gaussian(width=200)

Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program 3 4 5 6 7

on Different Datasets Figure: Closely Spaced Concentric Circles

Figure: Far Spaced Concentric Circles

Figure: Moon

Figure: Noisy(5%) Blobs

Figure: Linear Dataset

Figure: Moon with high noise(40%)

Figure: Circles with high noise(40%)

Figure: Linear Separable with High noise

Kernel Weight Comparison Multi-kernel weighs different kernels for different dataset automatically selects the good model. Kernels: Gaussian(width=1),Polynomial(degree=4),Sigmoid,Linear Dataset1 9.99e-01 2.09-08 1.02e-10 1.05e-10 Dataset2 2.77e-03 9.97e-01 8.75e-07 9.10e-07 Dataset3 8.77e-01 8.27e-05 1.95e-06 1.23e-01 Dataset4 8.08e-01 9.56e-05 1.31e-07 1.91e-01 Dataset5 9.85e-04 9.99e-01 1.40e-07 6.13e-07 Dataset6 7.18e-01 1.79e-01 6.36e-08 1.02e-01 Dataset7 9.99e-01 2.33e-06 1.79e-08 3.10e-08 Dataset8 8.44e-01 1.61e-07 6.63e-09 1.55e-01

Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program 3 4 5 6 7

Data Coming from Heterogenous Sources Dataset from Linear and Radial Sources

Data Coming from Heterogenous Sources Figure: One source is linear and other is gaussian

Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program 3 4 5 6 7

Regression on sine with varying frequency

Outline 1 Introduction Motivation 2 Binary Classification Dual Problem for M.K.L. Saddle Point Problem Semi-Infinite Linear Program 3 4 5 6 7

Classification on Real Datasets Dataset1: USPS Handwritten digit data Description: 4650 Training and test examples with ten classes each corresponding with one digit Kernel used:polynomial(degree=2),gaussian(width=15) Result MKL 94.9 Gaussian 93.1 Polynomial 93.5

Classification on Real Datasets Dataset2:Ionosphere Dataset from UCI repository Description: 280 Training and 71 Test data(10 fold) Kernel used:polynomial(degree=2),gaussian(width=15) Result MKL 94.1 Gaussian 92.1 Polynomial 91

automatically learns the efficient weighted distribution of kernels. gives less generalized error than any of the kernels, independent of data distribution and separability. also learns efficiently for outliers and noisy data. can learn data coming from different heterogenous sources i.e. Multimodal S

Apply on real multi-modal datasets such as video with audio and subtitles. Experiment with non-convex and non-linear combination of kernels

http://www.jmlr.org/papers/volume7/sonnenburg06a/sonnenburg06a.pdf http://www.di.ens.fr/ fbach/skm i cml.pdf http : //www.shogun toolbox.org/ http://www.shogun-toolbox.org/doc/en/3.0.0/index.html http://www.gaussianprocess.org/gpml/data/ https://archive.ics.uci.edu/ml/datasets/ionosphere

Thank You!