Lecture 9: Multi Kernel SVM

Similar documents
Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Support Vector Machines. Goals for the lecture

Soft-margin SVM can address linearly separable problems with outliers

Support Vector Machines MIT Course Notes Cynthia Rudin

Kernel Methods and Support Vector Machines

SimpleMKL. hal , version 1-26 Jan Abstract. Alain Rakotomamonjy LITIS EA 4108 Université de Rouen Saint Etienne du Rouvray, France

Geometrical intuition behind the dual problem

Understanding SVM (and associated kernel machines) through the development of a Matlab toolbox

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Lecture 6: Minimum encoding ball and Support vector data description (SVDD)

Support Vector Machines. Maximizing the Margin

Lecture 2: Linear SVM in the Dual

1 Bounding the Margin

Gary J. Balas Aerospace Engineering and Mechanics, University of Minnesota, Minneapolis, MN USA

Bayes Decision Rule and Naïve Bayes Classifier

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

ICS-E4030 Kernel Methods in Machine Learning

SVM and Kernel machine

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Kernel machines and sparsity

Research Article Robust ε-support Vector Regression

Support Vector Regression (SVR) Descriptions of SVR in this discussion follow that in Refs. (2, 6, 7, 8, 9). The literature

Robust minimum encoding ball and Support vector data description (SVDD)

Combining Classifiers

Stochastic Subgradient Methods

Boosting with log-loss

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

UNIVERSITY OF TRENTO ON THE USE OF SVM FOR ELECTROMAGNETIC SUBSURFACE SENSING. A. Boni, M. Conci, A. Massa, and S. Piffer.

Support Vector Machines for Classification and Regression

Introduction to Discrete Optimization

A Sequential Dual Method for Large Scale Multi-Class Linear SVMs

Lecture 21. Interior Point Methods Setup and Algorithm

Robustness and Regularization of Support Vector Machines

E. Alpaydın AERFAISS

Jeff Howbert Introduction to Machine Learning Winter

Machine Learning: Fisher s Linear Discriminant. Lecture 05

Convex Optimization and SVM

PAC-Bayes Analysis Of Maximum Entropy Learning

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Support Vector Machine (continued)

Lecture 10: A brief introduction to Support Vector Machine

Predictive Vaccinology: Optimisation of Predictions Using Support Vector Machine Classifiers

Support Vector Machines

CS-E4830 Kernel Methods in Machine Learning

Comments on the Core Vector Machines: Fast SVM Training on Very Large Data Sets

Introduction to Kernel methods

Lecture Support Vector Machine (SVM) Classifiers

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Multi-view Discriminative Manifold Embedding for Pattern Classification

Ensemble Based on Data Envelopment Analysis

Solving the SVM Optimization Problem

Kernel Methods. Machine Learning A W VO

The Methods of Solution for Constrained Nonlinear Programming

CS Lecture 13. More Maximum Likelihood

Support vector machines

SUPPORT VECTOR MACHINE FOR THE SIMULTANEOUS APPROXIMATION OF A FUNCTION AND ITS DERIVATIVE

Statistical Machine Learning from Data

Recovering Data from Underdetermined Quadratic Measurements (CS 229a Project: Final Writeup)

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Sharp Time Data Tradeoffs for Linear Inverse Problems

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Review: Support vector machines. Machine learning techniques and image analysis

On Fuzzy Three Level Large Scale Linear Programming Problem

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

LMS Algorithm Summary

RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Pattern Recognition and Machine Learning. Artificial Neural networks

Lecture Notes on Support Vector Machine

Support Vector Machines, Kernel SVM

Fast Kernel Learning using Sequential Minimal Optimization

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Mixed Robust/Average Submodular Partitioning

P016 Toward Gauss-Newton and Exact Newton Optimization for Full Waveform Inversion

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

A Theoretical Analysis of a Warm Start Technique

Soft-margin SVM can address linearly separable problems with outliers

Max Margin-Classifier

Convex Programming for Scheduling Unrelated Parallel Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

On Constant Power Water-filling

The Simplex Method is Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate

arxiv: v1 [cs.lg] 17 Jun 2011

Optimization. Yuh-Jye Lee. March 21, Data Science and Machine Intelligence Lab National Chiao Tung University 1 / 29

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Perceptron Revisited: Linear Separators. Support Vector Machines

Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Inexact Proximal Gradient Methods for Non-Convex and Non-Smooth Optimization

Homework 4. Convex Optimization /36-725

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

An improved self-adaptive harmony search algorithm for joint replenishment problems

Support Vector Machines

Machine Learning A Geometric Approach

A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine. (1900 words)

Convex Optimization and Support Vector Machine

Linear & nonlinear classifiers

Transcription:

Lecture 9: Multi Kernel SVM Stéphane Canu stephane.canu@litislab.eu Sao Paulo 204 April 6, 204

Roadap Tuning the kernel: MKL The ultiple kernel proble Sparse kernel achines for regression: SVR SipleMKL: the ultiple kernel solution

Stéphane Canu (INSA Rouen - LITIS) April 6, 204 3 / 2 Standard Learning with Kernels User kernel k data Learning Machine f http://www.cs.nyu.edu/~ohri/icl20-tutorial/tutorial-icl20-2.pdf

Stéphane Canu (INSA Rouen - LITIS) April 6, 204 3 / 2 Learning Kernel fraework User kernel faily k data Learning Machine f, k(.,.) http://www.cs.nyu.edu/~ohri/icl20-tutorial/tutorial-icl20-2.pdf

Stéphane Canu (INSA Rouen - LITIS) April 6, 204 4 / 2 fro SVM SVM: single kernel k f(x) = n i= α i k (x,x i )+b = http://www.nowozin.net/sebastian/talks/iccv-2009-lpbeta.pdf

Stéphane Canu (INSA Rouen - LITIS) April 6, 204 4 / 2 fro SVM to Multiple Kernel Learning (MKL) SVM: single kernel k MKL: set of M kernels k,...,k,...,k M learn classier and cobination weights can be cast as a convex optiization proble f(x) = n i= M α i d k (x,x i )+b = M d = and 0 d = = http://www.nowozin.net/sebastian/talks/iccv-2009-lpbeta.pdf

Stéphane Canu (INSA Rouen - LITIS) April 6, 204 4 / 2 fro SVM to Multiple Kernel Learning (MKL) SVM: single kernel k MKL: set of M kernels k,...,k,...,k M learn classier and cobination weights can be cast as a convex optiization proble f(x) = = n M M α i d k (x,x i )+b d = and 0 d i= = = n α i K(x,x i )+b with K(x,x i ) = i= = M d k (x,x i ) http://www.nowozin.net/sebastian/talks/iccv-2009-lpbeta.pdf

Stéphane Canu (INSA Rouen - LITIS) April 6, 204 5 / 2 Multiple Kernel The odel n M f(x) = α i d k (x, x i )+b, i= = M d = and 0 d = Given M kernel functions k,...,k M that are potentially well suited for a given proble, find a positive linear cobination of these kernels such that the resulting kernel k is optial k(x,x ) = M = d k (x,x ), with d 0, d = Learning together The kernel coefficients d and the SVM paraeters α i, b.

Multiple Kernel: illustration Stéphane Canu (INSA Rouen - LITIS) April 6, 204 6 / 2

Multiple Kernel Strategies Wrapper ethod (Weston et al., 2000; Chapelle et al., 2002) solve SVM gradient descent on d on criterion: argin criterion span criterion Kernel Learning & Feature Selection use Kernels as dictionary Ebedded Multi Kernel Learning (MKL) Stéphane Canu (INSA Rouen - LITIS) April 6, 204 7 / 2

Stéphane Canu (INSA Rouen - LITIS) April 6, 204 8 / 2 Multiple Kernel functional Learning The proble (for given C) 2 f 2 H + C ξ i ( i with y i f(xi )+b ) +ξ i ; ξ i 0 i in f H,b,ξ,d M d =, d 0, = f = M f and k(x,x ) = d k (x,x ), with d 0 = The functional fraework H = M = H f, g H = d f, g H

Stéphane Canu (INSA Rouen - LITIS) April 6, 204 9 / 2 Multiple Kernel functional Learning The proble (for given C) f 2 H 2 d + C ( i with y i in {f },b,ξ,d f (x i )+b ) +ξ i ; ξ i 0 d =, d 0, ξ i i Treated as a bi-level optiization task in f 2 H {f },b,ξ 2 d + C ξ i in ( i d IR M with y i f (x i )+b ) +ξ i ; ξ i 0 s.t. d =, d 0, i

Stéphane Canu (INSA Rouen - LITIS) April 6, 204 0 / 2 Multiple Kernel representer theore and dual The Lagrangian: L = 2 d f 2 H + C i ξ i i ( ( α i y i f (x i )+b ) ) ξ i i β i ξ i Associated KKT stationarity conditions: L = 0 d f ( ) = n α i y i k (,x i ) i= =, M Representer theore f( ) = f ( ) = n α i y i d k (,x i ) i= } {{ } K(,x i ) We have a standard SVM proble with respect to function f and kernel K.

Stéphane Canu (INSA Rouen - LITIS) April 6, 204 / 2 Multiple Kernel Algorith Use a Reduced Gradient Algorith in d IR M s.t. J(d) d =, d 0, SipleMKL algorith set d = M for =,...,M while stopping criterion not et do copute J(d) using an QP solver with K = d K copute J d, and projected gradient as a descent direction D γ copute optial stepsize d d +γd end while Iproveent reported using the Hessian Rakotoaonjy et al. JMLR 08

Stéphane Canu (INSA Rouen - LITIS) April 6, 204 2 / 2 Coputing the reduced gradient At the optial the prial cost = dual cost f 2 H 2 d + C ξ i i }{{} prial cost with G = d G where G,ij = k (x i,x j ) Dual cost is easier for the gradient d J(d) = 2 α G α = 2 α Gα e α }{{} dual cost Reduce (or project) to check the constraints d = D = 0 D = d J(d) d J(d) and D = M =2 D

Stéphane Canu (INSA Rouen - LITIS) April 6, 204 3 / 2 Coplexity For each iteration: SVM training: O(nn sv + n 3 sv). Inverting K sv,sv is O(n 3 sv), but ight already be available as a by-product of the SVM training. Coputing H: O(Mn 2 sv) Finding d: O(M 3 ). The nuber of iterations is usually less than 0. When M < n sv, coputing d is not ore expensive than QP.

Stéphane Canu (INSA Rouen - LITIS) April 6, 204 4 / 2 MKL on the 0-caltech dataset http://www.robots.ox.ac.uk/~vgg/software/mkl/

Support vector regression (SVR) the t-insensitive loss { in f H with 2 f 2 H f(x i ) y i t, i =, n The support vector regression introduce slack variables { in (SVR) f H 2 f 2 H + C ξ i with f(x i ) y i t +ξ i 0 ξ i i =, n a typical ulti paraetric quadratic progra (pqp) piecewise linear regularization path α(c, t) = α(c 0, t 0 )+( C C 0 )u+ C 0 (t t 0 )v 2d Pareto s front (the tube width and the regularity)

y y Support vector regression illustration Support Vector Machine Regression.5 Support Vector Machine Regression 0.8 0.6 0.4 0.5 0.2 0 0 0.2 0.4 0.5 0.6 0.8 0 2 3 4 5 6 7 8 x.5 0 0.5.5 2 2.5 3 3.5 4 4.5 5 x C large C sall there exists other forulations such as LP SVR...

Stéphane Canu (INSA Rouen - LITIS) April 6, 204 7 / 2 Multiple Kernel Learning for regression The proble (for given C and t) in {f },b,ξ,d s.t. f 2 H 2 d + C ξ i i f (x i )+b y i t +ξi iξ i 0 i d =, d 0, regularization forulation in f 2 H {f },b,d 2 d + C t, ax( f (x i )+b y i 0) i d =, d 0, Equivalently ( ax in },b,ξ,d i ) f (x i )+b y i t, 0 + 2C d f 2 H +µ d

Stéphane Canu (INSA Rouen - LITIS) April 6, 204 8 / 2 Multiple Kernel functional Learning The proble (for given C and t) in {f },b,ξ,d s.t. f 2 H 2 d + C ξ i i f (x i )+b y i t +ξi iξ i 0 i d =, d 0, Treated as a bi-level optiization task in f 2 H {f },b,ξ 2 d + C ξ i i in d IR s.t. f M (x i )+b y i t +ξi ξ i 0 i s.t. d =, d 0, i

Stéphane Canu (INSA Rouen - LITIS) April 6, 204 9 / 2 Multiple Kernel experients LinChirp Wave Blocks Spikes 0.5 0.8 0.8 0.8 0 0.6 0.6 0.6 0.5 0.4 0.4 0.4 0 0.2 0.4 0.6 0.8 0.2 0 0.2 0.4 0.6 0.8 0.2 0 0.2 0.4 0.6 0.8 0.2 0 0.2 0.4 0.6 0.8 2 0.8 0.6 0 0.5 0.4 0.5 0.2 2 0 0.2 0.4 0.6 0.8 x 0 0 0.2 0.4 0.6 0.8 x 0 0 0.2 0.4 0.6 0.8 x 0 0 0.2 0.4 0.6 0.8 x Single Kernel Kernel Dil Kernel Dil-Trans Data Set Nor. MSE (%) #Kernel Nor. MSE #Kernel Nor. MSE LinChirp.46 ± 0.28 7.0.00 ± 0.5 2.5 0.92 ± 0.20 Wave 0.98 ± 0.06 5.5 0.73 ± 0.0 20.6 0.79 ± 0.07 Blocks.96 ± 0.4 6.0 2. ± 0.2 9.4.94 ± 0.3 Spike 6.85 ± 0.68 6. 6.97 ± 0.84 2.8 5.58 ± 0.84 Table: Noralized Mean Square error averaged over 20 runs.

Conclusion on ultiple kernel (MKL) MKL: Kernel tuning, variable selection... extention to classification and one class SVM SVM KM: an efficient Matlab toolbox (available at MLOSS) 2 Multiple Kernels for Iage Classification: Software and Experients on Caltech-0 3 new trend: Multi kernel, Multi task and nuber of kernels 2 http://loss.org/software/view/33/ 3 http://www.robots.ox.ac.uk/~vgg/software/mkl/

Stéphane Canu (INSA Rouen - LITIS) April 6, 204 2 / 2 Bibliography A. Rakotoaonjy, F. Bach, S. Canu & Y. Grandvalet. SipleMKL. J. Mach. Learn. Res. 2008, 9:249 252. M. Gönen & E. Alpaydin Multiple kernel learning algoriths. J. Mach. Learn. Res. 2008;2:22-2268. http://www.cs.nyu.edu/~ohri/icl20-tutorial/tutorial-icl20-2.pdf http://www.robots.ox.ac.uk/~vgg/software/mkl/ http://www.nowozin.net/sebastian/talks/iccv-2009-lpbeta.pdf