Mathematical Programming for Multiple Kernel Learning

Similar documents
MULTIPLEKERNELLEARNING CSE902

SVMs, Duality and the Kernel Trick

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Machine Learning. Support Vector Machines. Manfred Huber

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Support Vector Machine (continued)

Support Vector Machines, Kernel SVM

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machine (SVM) and Kernel Methods

Lecture Notes on Support Vector Machine

Support Vector Machines

Support vector machines

Multiple Kernel Learning

Statistical Machine Learning from Data

Max Margin-Classifier

CSC 411 Lecture 17: Support Vector Machine

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines: Maximum Margin Classifiers

Applied inductive learning - Lecture 7

1 Training and Approximation of a Primal Multiclass Support Vector Machine

Support Vector Machines

Solving the SVM Optimization Problem

Sequential Minimal Optimization (SMO)

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Support Vector Machines.

Kernels and the Kernel Trick. Machine Learning Fall 2017

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Kernelized Perceptron Support Vector Machines

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Review: Support vector machines. Machine learning techniques and image analysis

Announcements - Homework

Lecture 18: Optimization Programming

Learning SVM Classifiers with Indefinite Kernels

Introduction to Support Vector Machines

10701 Recitation 5 Duality and SVM. Ahmed Hefny

Formulation with slack variables

ICS-E4030 Kernel Methods in Machine Learning

Support Vector Machines

Machine Learning A Geometric Approach

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

SMO Algorithms for Support Vector Machines without Bias Term

Pattern Recognition 2018 Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

ν =.1 a max. of 10% of training set can be margin errors ν =.8 a max. of 80% of training can be margin errors

Support Vector Machines. Maximizing the Margin

Inferring Sparse Kernel Combinations and Relevance Vectors: An application to subcellular localization of proteins.

Kernel Methods and Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

An introduction to Support Vector Machines

Linear & nonlinear classifiers

Support Vector Machines

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Introduction to Support Vector Machines

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Lecture 10: Support Vector Machine and Large Margin Classifier

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Linear & nonlinear classifiers

Support Vector Machines and Speaker Verification

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Support Vector Machines (SVMs).

Improvements to Platt s SMO Algorithm for SVM Classifier Design

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Support Vector Machine via Nonlinear Rescaling Method

(Kernels +) Support Vector Machines

Polyhedral Computation. Linear Classifiers & the SVM

Support Vector Machines

Support Vector Machines

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machine

Support Vector Machines

Support vector machines Lecture 4

SVM and Kernel machine

CS145: INTRODUCTION TO DATA MINING

CS269: Machine Learning Theory Lecture 16: SVMs and Kernels November 17, 2010

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

CS-E4830 Kernel Methods in Machine Learning

CS798: Selected topics in Machine Learning

Support Vector Machines for Classification: A Statistical Portrait

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

How Good is a Kernel When Used as a Similarity Measure?

Support Vector Machines

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines and Kernel Methods

Support Vector Machine

Applied Machine Learning Annalisa Marsico

Transcription:

Mathematical Programming for Multiple Kernel Learning Alex Zien Fraunhofer FIRST.IDA, Berlin, Germany Friedrich Miescher Laboratory, Tübingen, Germany 07. July 2009 Mathematical Programming Stream, EURO 2009 Bonn, Germany

The Goal: Multiple Kernel Learning Given N data items x n X (n = 1,..., N), with associated labels y n { 1, +1} (n = 1,..., N), P feature maps Φ p : X R dp (p = 1,..., P ), and a regularization constant 0 < C R, efficiently solve the Muliple Kernel Learning (MKL) problem min w,b,ξ,β 1 2 P β p w p 2 + C 1, ξ p=1 s.t. 0 ξ R N, P n : y i p=1 0 β R p, 1, β 1. β p wp, Φ p (x) 1 ξ n,

Application: Subcellular Localization of Proteins Input: protein sequence Output: target location of the protein [image from A primer on molecular biology in Kernel Methods in Computational Biology, MIT Press, 2004]

Outline 1 The Goal: MKL for Classification 2 Why Optimize This Problem? Support Vector Machines (SVMs) Non-Linearity via Kernels A Large Margin MKL Model 3 Optimization for MKL 4 Application: Predicting Protein Subcellular Localization 5 Take Home Messages

Support Vector Machine (SVM): Maximum Margin SVM: maximum margin classifier w w, x + b = +1 w, x + b = 1 w, x + b = 0 min w,b 1 w, w } 2 {{} regularizer s.t. y i ( w, x i + b) 1 }{{} data fitting

Support Vector Machine: Dual Is QP Primal: min w,b,(ξ i ) 1 2 w, w + C i ξ i s.t. y i ( w, x i + b) 1 ξ i, ξ i 0 at optimum: ξ i = l(x i, y i ) := max {1 y i ( w, x i + b), 0} Dual: 1 min α 2 i j α iα j y i y j x i, x j i α i s.t. 0 α C1 y α = 0 only needs pairwise dot products!

Non-Linear Mappings Example: All Degree 2 Monomials for a 2D Input Φ : R 2 R 3 =: H ( Feature Space ) (x 1, x 2 ) (z 1, z 2, z 3 ) := (x 2 1, 2 x 1 x 2, x 2 2) x 1 x 2 z 1 z 3 z 2

Kernel Trick Example: All Degree 2 Monomials for a 2D Input Φ(x), Φ(x ) = (x 2 1, 2 x 1 x 2, x 2 2), (x 2 1, 2 x 1x 2, x 2 2 ) = x 2 1x 2 1 + 2 x 1 x 2 x 1x 2 + x 2 2x 2 2 = ( x 1 x 1 + x 2 x 2 = x, x 2 =: k(x, x ) ) 2 the dot product in H can be computed in R 2 more information: Lerning with Kernels ; Schölkopf, Smola; MIT Press]

Relevant SVM Facts dual is QP support vectors sparse solution efficient optimization non-linear via kernels So how to choose the best kernel function?

A Multiple Kernel Learning (MKL) Model MKL Model: weighted linear mixture of P feature spaces H γ γ 1 H 1 γ 2 H 2... γ P H P Φ γ (x) (γ p Φ p (x) ) k γ (x, x ) w γ p=1,...,p P γp Φ p (x), γ p Φ p (x ) = p=1 ( γ p w p ) p=1,...,p f γ (x) w γ, Φ γ (x) = P p=1 P γpk 2 p (x, x ) p=1 γp 2 wp, Φ p (x) Goal: learn mixing coefficients γ = (γ p ) p=1,...,p along with w, b

Large Margin MKL Plugging it into the SVM min w,b,ξ,γ 1 2 w γ 2 + C N i=1 s.t. i : ξ i = l ( w γ, Φ γ (x) + b, y i ) ξ i yields: min w,b,ξ,γ s.t. 1 P N γp 2 w p 2 + C ξ i 2 p=1 i=1 P i : ξ i = l γp 2 wp, Φ p (x), y i p=1 for convenience we substitute β p := γ 2 p

MKL: Retain Convexity and Feasibility Problems: (i) products β p w p non-convex; (ii) β + min β,w,b,ξ s.t. 1 2 P β p w p 2 + C N p=1 i=1 P i : ξ i = l β p w p, Φ p (x i ), y i p=1 ξ i Solutions: (i) change of variables v p := β p w p ; (ii) constrain β min β,v,b,ξ s.t. 1 P 1 N v p 2 + C ξ i 2 β p=1 p i=1 P i : ξ i = l v p, Φ p (x i ), y i, β 1 1 p=1 [Zien, Ong; ICML 2007] [Rakotomamonjy, Bach et al.; ICML 2007]

MKL Lagrange Dual For hinge loss, l(t, y) := max {0, 1 yt}: min α s.t. γ α i i i : 0 α i C y i α i = 0 i p : γ 1 y i y j α i α j Φ p (x i ), Φ p (x j ) 2 i j quadratically constrained quadratic program (QCQP) for single kernel, reduces to standard SVM may be solved by off-the-shelf solvers (eg CPLEX) [Zien and Ong; ICML 2007]

MKL as SILP Semi-Infinite Linear Program (SILP): max θ β, θ s.t. p β p = 1, β p 0 α S : θ 1 β p w p (α) 2 2 p i with w p (α) := i y iα i Φ p (x i ) α i Infinite set of constraints indexed by: { S := α i : 0 α i C i y iα i = 0 } [Sonnenburg, Rätsch, Schäfer; NIPS 2006]

MKL Wrapper by Column Generation (1) 1 initialize LP with minimal set of constraints: β p = 1, β p 0 2 initialize β to feasible value (eg β p = 1/P ) 3 iterate: for given β, find most violated constraint: minimize 1 β p w p (α) 2 α i st α S 2 p i solve single-kernel SVM! add this constraint to LP solve LP to obtain new mixing coefficients β just need wrapper around single-kernel method p

MKL Wrapper by Column Generation (2) Alternate between solving an LP for β and a QP for α.

Efficiency Comparison The SILP-based wrapper is much more efficient than the QCQP. Reason: it exploits efficiency of existing SVM solvers. The same holds for projected-gradient approach. [Rakotomamonjy, Bach et al.; ICML 2007] Idea to gain still more efficieny: each constraint (gradient) requires full SVM training although most constraints are auxilliary anyway ie, they are based on suboptimal β could use approximate constraints ie, based on approximately trained SVMs

Chunking Approach Idea: interleave β-update with SVM chunking SVM chunking: repeatedly optimize subproblem select subset of α variables ( chunk ) optimize on this subset clever subset selection efficiency [Joachims; Making Large Scale SVM Learning Practical ] after each chunk, generate constraint and update β [Sonnenburg, Rätsch, Schäfer; NIPS 2006] thereafter, SVM optimization can simply continue (similar to a hot start ) [implemented in Shogun, http://www.shogun-toolbox.org/]

Efficiency Comparison 10 2 dashed blue: wrapper full black: SVM (single kernel) others: chunking MKL time in seconds 10 1 10 0 10 1 10 2 10 2 10 3 sample size

Blue Picture of a Cell Input: protein sequence Output: target location of the protein [image taken from the internet]

List of 69 Kernels 64 = 4*16 Motif kernels 4 subsequences (all, last 15, first 15, first 60) 16 = 2 5 1 patterns of length 5 (,,,, ) 3 BLAST similarity kernels 1 linear kernel on E-values 2 Gaussian kernel on E-values, width 1000 3 Gaussian kernel on log E-values, width 1e5 2 phylogenetic kernels 1 linear kernel 2 Gaussian kernel, width 300 [all described in: C. S. Ong, A. Zien; WABI 2008]

Better Than Previous Work 100.0 98.0 MCC [%] (plant, nonplant) F1 [%] (psort+, psort-) plant nonplant psort+ psortperformance (higher is better) 96.0 94.0 92.0 90.0 88.0 86.0 84.0 mkl avg other 82.0 80.0 MCMKL unweighted sum of kernels TargetLoc / PSORTb v2.0

Better Than Single Kernels and Than Average Kernel 1, MKL - - - -, sum with uniform weight F1 score 0.9 0.8 0.7 0.6 0.5 bars, kernel single 0.4 0.3 1 pylogenetic profiles 2 BLAST similarities 0.2 0 10 20 30 40 50 60 70 3 motifs, complete sequence 4 motifs, last 15 AAs 5 motifs, first 15 AAs 6 motifs, first 60 AAs

Weights Single-Kernel Performances

What You Should Take Home From This Talk SVMs are QPs, but require special solvers for efficiency MKL amounts to a QCQP... but formulating it as SILP is more efficient, because it reduces to several (highly optimized) SVMs... but it can be solved even more efficiently by interleaving SVM training with the outer optimization yes, we need the speed: faster more training data higher accuracy Questions?