Mathematical Programming for Multiple Kernel Learning

Mathematical Programming for Multiple Kernel Learning Alex Zien Fraunhofer FIRST.IDA, Berlin, Germany Friedrich Miescher Laboratory, Tübingen, Germany 07. July 2009 Mathematical Programming Stream, EURO 2009 Bonn, Germany

The Goal: Multiple Kernel Learning Given N data items x n X (n = 1,..., N), with associated labels y n { 1, +1} (n = 1,..., N), P feature maps Φ p : X R dp (p = 1,..., P ), and a regularization constant 0 < C R, efficiently solve the Muliple Kernel Learning (MKL) problem min w,b,ξ,β 1 2 P β p w p 2 + C 1, ξ p=1 s.t. 0 ξ R N, P n : y i p=1 0 β R p, 1, β 1. β p wp, Φ p (x) 1 ξ n,

Application: Subcellular Localization of Proteins Input: protein sequence Output: target location of the protein [image from A primer on molecular biology in Kernel Methods in Computational Biology, MIT Press, 2004]

Outline 1 The Goal: MKL for Classification 2 Why Optimize This Problem? Support Vector Machines (SVMs) Non-Linearity via Kernels A Large Margin MKL Model 3 Optimization for MKL 4 Application: Predicting Protein Subcellular Localization 5 Take Home Messages

Support Vector Machine (SVM): Maximum Margin SVM: maximum margin classifier w w, x + b = +1 w, x + b = 1 w, x + b = 0 min w,b 1 w, w } 2 {{} regularizer s.t. y i ( w, x i + b) 1 }{{} data fitting

Support Vector Machine: Dual Is QP Primal: min w,b,(ξ i ) 1 2 w, w + C i ξ i s.t. y i ( w, x i + b) 1 ξ i, ξ i 0 at optimum: ξ i = l(x i, y i ) := max {1 y i ( w, x i + b), 0} Dual: 1 min α 2 i j α iα j y i y j x i, x j i α i s.t. 0 α C1 y α = 0 only needs pairwise dot products!

Non-Linear Mappings Example: All Degree 2 Monomials for a 2D Input Φ : R 2 R 3 =: H ( Feature Space ) (x 1, x 2 ) (z 1, z 2, z 3 ) := (x 2 1, 2 x 1 x 2, x 2 2) x 1 x 2 z 1 z 3 z 2

Kernel Trick Example: All Degree 2 Monomials for a 2D Input Φ(x), Φ(x ) = (x 2 1, 2 x 1 x 2, x 2 2), (x 2 1, 2 x 1x 2, x 2 2 ) = x 2 1x 2 1 + 2 x 1 x 2 x 1x 2 + x 2 2x 2 2 = ( x 1 x 1 + x 2 x 2 = x, x 2 =: k(x, x ) ) 2 the dot product in H can be computed in R 2 more information: Lerning with Kernels ; Schölkopf, Smola; MIT Press]

Relevant SVM Facts dual is QP support vectors sparse solution efficient optimization non-linear via kernels So how to choose the best kernel function?

A Multiple Kernel Learning (MKL) Model MKL Model: weighted linear mixture of P feature spaces H γ γ 1 H 1 γ 2 H 2... γ P H P Φ γ (x) (γ p Φ p (x) ) k γ (x, x ) w γ p=1,...,p P γp Φ p (x), γ p Φ p (x ) = p=1 ( γ p w p ) p=1,...,p f γ (x) w γ, Φ γ (x) = P p=1 P γpk 2 p (x, x ) p=1 γp 2 wp, Φ p (x) Goal: learn mixing coefficients γ = (γ p ) p=1,...,p along with w, b

Large Margin MKL Plugging it into the SVM min w,b,ξ,γ 1 2 w γ 2 + C N i=1 s.t. i : ξ i = l ( w γ, Φ γ (x) + b, y i ) ξ i yields: min w,b,ξ,γ s.t. 1 P N γp 2 w p 2 + C ξ i 2 p=1 i=1 P i : ξ i = l γp 2 wp, Φ p (x), y i p=1 for convenience we substitute β p := γ 2 p

MKL: Retain Convexity and Feasibility Problems: (i) products β p w p non-convex; (ii) β + min β,w,b,ξ s.t. 1 2 P β p w p 2 + C N p=1 i=1 P i : ξ i = l β p w p, Φ p (x i ), y i p=1 ξ i Solutions: (i) change of variables v p := β p w p ; (ii) constrain β min β,v,b,ξ s.t. 1 P 1 N v p 2 + C ξ i 2 β p=1 p i=1 P i : ξ i = l v p, Φ p (x i ), y i, β 1 1 p=1 [Zien, Ong; ICML 2007] [Rakotomamonjy, Bach et al.; ICML 2007]

MKL Lagrange Dual For hinge loss, l(t, y) := max {0, 1 yt}: min α s.t. γ α i i i : 0 α i C y i α i = 0 i p : γ 1 y i y j α i α j Φ p (x i ), Φ p (x j ) 2 i j quadratically constrained quadratic program (QCQP) for single kernel, reduces to standard SVM may be solved by off-the-shelf solvers (eg CPLEX) [Zien and Ong; ICML 2007]

MKL as SILP Semi-Infinite Linear Program (SILP): max θ β, θ s.t. p β p = 1, β p 0 α S : θ 1 β p w p (α) 2 2 p i with w p (α) := i y iα i Φ p (x i ) α i Infinite set of constraints indexed by: { S := α i : 0 α i C i y iα i = 0 } [Sonnenburg, Rätsch, Schäfer; NIPS 2006]

MKL Wrapper by Column Generation (1) 1 initialize LP with minimal set of constraints: β p = 1, β p 0 2 initialize β to feasible value (eg β p = 1/P ) 3 iterate: for given β, find most violated constraint: minimize 1 β p w p (α) 2 α i st α S 2 p i solve single-kernel SVM! add this constraint to LP solve LP to obtain new mixing coefficients β just need wrapper around single-kernel method p

MKL Wrapper by Column Generation (2) Alternate between solving an LP for β and a QP for α.

Efficiency Comparison The SILP-based wrapper is much more efficient than the QCQP. Reason: it exploits efficiency of existing SVM solvers. The same holds for projected-gradient approach. [Rakotomamonjy, Bach et al.; ICML 2007] Idea to gain still more efficieny: each constraint (gradient) requires full SVM training although most constraints are auxilliary anyway ie, they are based on suboptimal β could use approximate constraints ie, based on approximately trained SVMs

Chunking Approach Idea: interleave β-update with SVM chunking SVM chunking: repeatedly optimize subproblem select subset of α variables ( chunk ) optimize on this subset clever subset selection efficiency [Joachims; Making Large Scale SVM Learning Practical ] after each chunk, generate constraint and update β [Sonnenburg, Rätsch, Schäfer; NIPS 2006] thereafter, SVM optimization can simply continue (similar to a hot start ) [implemented in Shogun, http://www.shogun-toolbox.org/]

Efficiency Comparison 10 2 dashed blue: wrapper full black: SVM (single kernel) others: chunking MKL time in seconds 10 1 10 0 10 1 10 2 10 2 10 3 sample size

Blue Picture of a Cell Input: protein sequence Output: target location of the protein [image taken from the internet]

List of 69 Kernels 64 = 4*16 Motif kernels 4 subsequences (all, last 15, first 15, first 60) 16 = 2 5 1 patterns of length 5 (,,,, ) 3 BLAST similarity kernels 1 linear kernel on E-values 2 Gaussian kernel on E-values, width 1000 3 Gaussian kernel on log E-values, width 1e5 2 phylogenetic kernels 1 linear kernel 2 Gaussian kernel, width 300 [all described in: C. S. Ong, A. Zien; WABI 2008]

Better Than Previous Work 100.0 98.0 MCC [%] (plant, nonplant) F1 [%] (psort+, psort-) plant nonplant psort+ psortperformance (higher is better) 96.0 94.0 92.0 90.0 88.0 86.0 84.0 mkl avg other 82.0 80.0 MCMKL unweighted sum of kernels TargetLoc / PSORTb v2.0

Better Than Single Kernels and Than Average Kernel 1, MKL - - - -, sum with uniform weight F1 score 0.9 0.8 0.7 0.6 0.5 bars, kernel single 0.4 0.3 1 pylogenetic profiles 2 BLAST similarities 0.2 0 10 20 30 40 50 60 70 3 motifs, complete sequence 4 motifs, last 15 AAs 5 motifs, first 15 AAs 6 motifs, first 60 AAs

Weights Single-Kernel Performances

What You Should Take Home From This Talk SVMs are QPs, but require special solvers for efficiency MKL amounts to a QCQP... but formulating it as SILP is more efficient, because it reduces to several (highly optimized) SVMs... but it can be solved even more efficiently by interleaving SVM training with the outer optimization yes, we need the speed: faster more training data higher accuracy Questions?