Lecture 9: Multi Kernel SVM Stéphane Canu stephane.canu@litislab.eu Sao Paulo 204 April 6, 204
Roadap Tuning the kernel: MKL The ultiple kernel proble Sparse kernel achines for regression: SVR SipleMKL: the ultiple kernel solution
Stéphane Canu (INSA Rouen - LITIS) April 6, 204 3 / 2 Standard Learning with Kernels User kernel k data Learning Machine f http://www.cs.nyu.edu/~ohri/icl20-tutorial/tutorial-icl20-2.pdf
Stéphane Canu (INSA Rouen - LITIS) April 6, 204 3 / 2 Learning Kernel fraework User kernel faily k data Learning Machine f, k(.,.) http://www.cs.nyu.edu/~ohri/icl20-tutorial/tutorial-icl20-2.pdf
Stéphane Canu (INSA Rouen - LITIS) April 6, 204 4 / 2 fro SVM SVM: single kernel k f(x) = n i= α i k (x,x i )+b = http://www.nowozin.net/sebastian/talks/iccv-2009-lpbeta.pdf
Stéphane Canu (INSA Rouen - LITIS) April 6, 204 4 / 2 fro SVM to Multiple Kernel Learning (MKL) SVM: single kernel k MKL: set of M kernels k,...,k,...,k M learn classier and cobination weights can be cast as a convex optiization proble f(x) = n i= M α i d k (x,x i )+b = M d = and 0 d = = http://www.nowozin.net/sebastian/talks/iccv-2009-lpbeta.pdf
Stéphane Canu (INSA Rouen - LITIS) April 6, 204 4 / 2 fro SVM to Multiple Kernel Learning (MKL) SVM: single kernel k MKL: set of M kernels k,...,k,...,k M learn classier and cobination weights can be cast as a convex optiization proble f(x) = = n M M α i d k (x,x i )+b d = and 0 d i= = = n α i K(x,x i )+b with K(x,x i ) = i= = M d k (x,x i ) http://www.nowozin.net/sebastian/talks/iccv-2009-lpbeta.pdf
Stéphane Canu (INSA Rouen - LITIS) April 6, 204 5 / 2 Multiple Kernel The odel n M f(x) = α i d k (x, x i )+b, i= = M d = and 0 d = Given M kernel functions k,...,k M that are potentially well suited for a given proble, find a positive linear cobination of these kernels such that the resulting kernel k is optial k(x,x ) = M = d k (x,x ), with d 0, d = Learning together The kernel coefficients d and the SVM paraeters α i, b.
Multiple Kernel: illustration Stéphane Canu (INSA Rouen - LITIS) April 6, 204 6 / 2
Multiple Kernel Strategies Wrapper ethod (Weston et al., 2000; Chapelle et al., 2002) solve SVM gradient descent on d on criterion: argin criterion span criterion Kernel Learning & Feature Selection use Kernels as dictionary Ebedded Multi Kernel Learning (MKL) Stéphane Canu (INSA Rouen - LITIS) April 6, 204 7 / 2
Stéphane Canu (INSA Rouen - LITIS) April 6, 204 8 / 2 Multiple Kernel functional Learning The proble (for given C) 2 f 2 H + C ξ i ( i with y i f(xi )+b ) +ξ i ; ξ i 0 i in f H,b,ξ,d M d =, d 0, = f = M f and k(x,x ) = d k (x,x ), with d 0 = The functional fraework H = M = H f, g H = d f, g H
Stéphane Canu (INSA Rouen - LITIS) April 6, 204 9 / 2 Multiple Kernel functional Learning The proble (for given C) f 2 H 2 d + C ( i with y i in {f },b,ξ,d f (x i )+b ) +ξ i ; ξ i 0 d =, d 0, ξ i i Treated as a bi-level optiization task in f 2 H {f },b,ξ 2 d + C ξ i in ( i d IR M with y i f (x i )+b ) +ξ i ; ξ i 0 s.t. d =, d 0, i
Stéphane Canu (INSA Rouen - LITIS) April 6, 204 0 / 2 Multiple Kernel representer theore and dual The Lagrangian: L = 2 d f 2 H + C i ξ i i ( ( α i y i f (x i )+b ) ) ξ i i β i ξ i Associated KKT stationarity conditions: L = 0 d f ( ) = n α i y i k (,x i ) i= =, M Representer theore f( ) = f ( ) = n α i y i d k (,x i ) i= } {{ } K(,x i ) We have a standard SVM proble with respect to function f and kernel K.
Stéphane Canu (INSA Rouen - LITIS) April 6, 204 / 2 Multiple Kernel Algorith Use a Reduced Gradient Algorith in d IR M s.t. J(d) d =, d 0, SipleMKL algorith set d = M for =,...,M while stopping criterion not et do copute J(d) using an QP solver with K = d K copute J d, and projected gradient as a descent direction D γ copute optial stepsize d d +γd end while Iproveent reported using the Hessian Rakotoaonjy et al. JMLR 08
Stéphane Canu (INSA Rouen - LITIS) April 6, 204 2 / 2 Coputing the reduced gradient At the optial the prial cost = dual cost f 2 H 2 d + C ξ i i }{{} prial cost with G = d G where G,ij = k (x i,x j ) Dual cost is easier for the gradient d J(d) = 2 α G α = 2 α Gα e α }{{} dual cost Reduce (or project) to check the constraints d = D = 0 D = d J(d) d J(d) and D = M =2 D
Stéphane Canu (INSA Rouen - LITIS) April 6, 204 3 / 2 Coplexity For each iteration: SVM training: O(nn sv + n 3 sv). Inverting K sv,sv is O(n 3 sv), but ight already be available as a by-product of the SVM training. Coputing H: O(Mn 2 sv) Finding d: O(M 3 ). The nuber of iterations is usually less than 0. When M < n sv, coputing d is not ore expensive than QP.
Stéphane Canu (INSA Rouen - LITIS) April 6, 204 4 / 2 MKL on the 0-caltech dataset http://www.robots.ox.ac.uk/~vgg/software/mkl/
Support vector regression (SVR) the t-insensitive loss { in f H with 2 f 2 H f(x i ) y i t, i =, n The support vector regression introduce slack variables { in (SVR) f H 2 f 2 H + C ξ i with f(x i ) y i t +ξ i 0 ξ i i =, n a typical ulti paraetric quadratic progra (pqp) piecewise linear regularization path α(c, t) = α(c 0, t 0 )+( C C 0 )u+ C 0 (t t 0 )v 2d Pareto s front (the tube width and the regularity)
y y Support vector regression illustration Support Vector Machine Regression.5 Support Vector Machine Regression 0.8 0.6 0.4 0.5 0.2 0 0 0.2 0.4 0.5 0.6 0.8 0 2 3 4 5 6 7 8 x.5 0 0.5.5 2 2.5 3 3.5 4 4.5 5 x C large C sall there exists other forulations such as LP SVR...
Stéphane Canu (INSA Rouen - LITIS) April 6, 204 7 / 2 Multiple Kernel Learning for regression The proble (for given C and t) in {f },b,ξ,d s.t. f 2 H 2 d + C ξ i i f (x i )+b y i t +ξi iξ i 0 i d =, d 0, regularization forulation in f 2 H {f },b,d 2 d + C t, ax( f (x i )+b y i 0) i d =, d 0, Equivalently ( ax in },b,ξ,d i ) f (x i )+b y i t, 0 + 2C d f 2 H +µ d
Stéphane Canu (INSA Rouen - LITIS) April 6, 204 8 / 2 Multiple Kernel functional Learning The proble (for given C and t) in {f },b,ξ,d s.t. f 2 H 2 d + C ξ i i f (x i )+b y i t +ξi iξ i 0 i d =, d 0, Treated as a bi-level optiization task in f 2 H {f },b,ξ 2 d + C ξ i i in d IR s.t. f M (x i )+b y i t +ξi ξ i 0 i s.t. d =, d 0, i
Stéphane Canu (INSA Rouen - LITIS) April 6, 204 9 / 2 Multiple Kernel experients LinChirp Wave Blocks Spikes 0.5 0.8 0.8 0.8 0 0.6 0.6 0.6 0.5 0.4 0.4 0.4 0 0.2 0.4 0.6 0.8 0.2 0 0.2 0.4 0.6 0.8 0.2 0 0.2 0.4 0.6 0.8 0.2 0 0.2 0.4 0.6 0.8 2 0.8 0.6 0 0.5 0.4 0.5 0.2 2 0 0.2 0.4 0.6 0.8 x 0 0 0.2 0.4 0.6 0.8 x 0 0 0.2 0.4 0.6 0.8 x 0 0 0.2 0.4 0.6 0.8 x Single Kernel Kernel Dil Kernel Dil-Trans Data Set Nor. MSE (%) #Kernel Nor. MSE #Kernel Nor. MSE LinChirp.46 ± 0.28 7.0.00 ± 0.5 2.5 0.92 ± 0.20 Wave 0.98 ± 0.06 5.5 0.73 ± 0.0 20.6 0.79 ± 0.07 Blocks.96 ± 0.4 6.0 2. ± 0.2 9.4.94 ± 0.3 Spike 6.85 ± 0.68 6. 6.97 ± 0.84 2.8 5.58 ± 0.84 Table: Noralized Mean Square error averaged over 20 runs.
Conclusion on ultiple kernel (MKL) MKL: Kernel tuning, variable selection... extention to classification and one class SVM SVM KM: an efficient Matlab toolbox (available at MLOSS) 2 Multiple Kernels for Iage Classification: Software and Experients on Caltech-0 3 new trend: Multi kernel, Multi task and nuber of kernels 2 http://loss.org/software/view/33/ 3 http://www.robots.ox.ac.uk/~vgg/software/mkl/
Stéphane Canu (INSA Rouen - LITIS) April 6, 204 2 / 2 Bibliography A. Rakotoaonjy, F. Bach, S. Canu & Y. Grandvalet. SipleMKL. J. Mach. Learn. Res. 2008, 9:249 252. M. Gönen & E. Alpaydin Multiple kernel learning algoriths. J. Mach. Learn. Res. 2008;2:22-2268. http://www.cs.nyu.edu/~ohri/icl20-tutorial/tutorial-icl20-2.pdf http://www.robots.ox.ac.uk/~vgg/software/mkl/ http://www.nowozin.net/sebastian/talks/iccv-2009-lpbeta.pdf